Very large strings like chromosome sequences receive special handling
in Bioconductor. We use a general container class called
BString for “big” strings that are distringuished from R
character vectors in that BStrings a) obey different rules for copying
and b) do not contain multiple strings (see the man page for BString).
Classes DNAString and AAString have
restrictions on the characters that can be managed in instances.
library(Biostrings)
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Warning: package 'IRanges' was built under R version 4.3.1
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## Warning: package 'GenomeInfoDb' was built under R version 4.3.1
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
bdemo = BString("BCDEF")
ddemo = try(DNAString("BCDEF"))
## Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, :
## key 69 (char 'E') not in lookup table
cat(ddemo)
## Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, :
## key 69 (char 'E') not in lookup table
ademo = try(AAString("BCDEF"))
Efficient management of multiple strings employs classes with “Set” as suffix.
ddem2 = DNAStringSet(c("ACTG", "GTCAG"))
ddem2
## DNAStringSet object of length 2:
## width seq
## [1] 4 ACTG
## [2] 5 GTCAG
The restrictions on contents of genomic strings are defined in
constant vectors in Biostrings. For example
AA_ALPHABET
## [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
## [20] "V" "U" "O" "B" "J" "Z" "X" "*" "-" "+" "."
IUPAC_CODE_MAP
## A C G T M R W S Y K V
## "A" "C" "G" "T" "AC" "AG" "AT" "CG" "CT" "GT" "ACG"
## H D B N
## "ACT" "AGT" "CGT" "ACGT"
There are over 200 functions defined in the Biostrings package, all devoted to computation on sequence data. Here’s an example illustrating basic notions.
D = DNAString("ACTGACGTACGTAGGCTAGCGATCGATATACGATATACG")
translate(D)
## 13-letter AAString object
## seq: TDVRRLAIDIRYT
codons(D)
## Views on a 39-letter DNAString subject
## subject: ACTGACGTACGTAGGCTAGCGATCGATATACGATATACG
## views:
## start end width
## [1] 1 3 3 [ACT]
## [2] 4 6 3 [GAC]
## [3] 7 9 3 [GTA]
## [4] 10 12 3 [CGT]
## [5] 13 15 3 [AGG]
## ... ... ... ... ...
## [9] 25 27 3 [GAT]
## [10] 28 30 3 [ATA]
## [11] 31 33 3 [CGA]
## [12] 34 36 3 [TAT]
## [13] 37 39 3 [ACG]
Notice that the output of codons is printed as a Views
instance. This is a very efficient approach to creating references to
subsequences of a sequence, without copying any data.
The statistical tests we have covered up to now leave out a substantial portion of life science projects. Specifically, we are referring to data that is binary, categorical and ordinal. To give a very specific example, consider genetic data where you have two groups of genotypes (AA/Aa or aa) for cases and controls for a given disease. The statistical question is if genotype and disease are associated. As in the examples we have been studying previously, we have two populations (AA/Aa and aa) and then numeric data for each, where disease status can be coded as 0 or 1. So why can’t we perform a t-test? Note that the data is either 0 (control) or 1 (cases). It is pretty clear that this data is not normally distributed so the t-distribution approximation is certainly out of the question. We could use CLT if the sample size is large enough; otherwise, we can use association tests.
One of the most famous examples of hypothesis testing was performed by R.A. Fisher. An acquaintance of Fisher’s claimed that she could tell if milk was added before or after tea was poured. Fisher gave her four pairs of cups of tea: one with milk poured first, the other after. The order was randomized. Say she picked 3 out of 4 correctly, do we believe she has a special ability? Hypothesis testing helps answer this question by quantifying what happens by chance. This example is called the “Lady Tasting Tea” experiment (and, as it turns out, Fisher’s friend was a scientist herself, Muriel Bristol).
The basic question we ask is: if the tester is actually guessing, what are the chances that she gets 3 or more correct? Just as we have done before, we can compute a probability under the null hypothesis that she is guessing 4 of each. If we assume this null hypothesis, we can think of this particular example as picking 4 balls out of an urn with 4 green (correct answer) and 4 red (incorrect answer) balls.
Under the null hypothesis that she is simply guessing, each ball has the same chance of being picked. We can then use combinatorics to figure out each probability. The probability of picking 3 is \({4 \choose 3} {4 \choose 1} / {8 \choose 4} = 16/70\). The probability of picking all 4 correct is \({4 \choose 4} {4 \choose 0}/{8 \choose 4}= 1/70\). Thus, the chance of observing a 3 or something more extreme, under the null hypothesis, is \(\approx 0.24\). This is the p-value. The procedure that produced this p-value is called Fisher’s exact test and it uses the hypergeometric distribution.
The data from the experiment above can be summarized by a two by two table:
tab <- matrix(c(3,1,1,3),2,2)
rownames(tab)<-c("Poured Before","Poured After")
colnames(tab)<-c("Guessed before","Guessed after")
tab
## Guessed before Guessed after
## Poured Before 3 1
## Poured After 1 3
The function fisher.test performs the calculations above
and can be obtained like this:
fisher.test(tab,alternative="greater")
##
## Fisher's Exact Test for Count Data
##
## data: tab
## p-value = 0.2429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.3135693 Inf
## sample estimates:
## odds ratio
## 6.408309
Genome-wide association studies (GWAS) have become ubiquitous in biology. One of the main statistical summaries used in these studies are Manhattan plots. The y-axis of a Manhattan plot typically represents the negative of log (base 10) of the p-values obtained for association tests applied at millions of single nucleotide polymorphisms (SNP). The x-axis is typically organized by chromosome (chromosome 1 to 22, X, Y, etc.). These p-values are obtained in a similar way to the test performed on the tea taster. However, in that example the number of green and red balls is experimentally fixed and the number of answers given for each category is also fixed. Another way to say this is that the sum of the rows and the sum of the columns are fixed. This defines constraints on the possible ways we can fill the two by two table and also permits us to use the hypergeometric distribution. In general, this is not the case. Nonetheless, there is another approach, the Chi-squared test, which is described below.
Imagine we have 250 individuals, where some of them have a given disease and the rest do not. We observe that 20% of the individuals that are homozygous for the minor allele (aa) have the disease compared to 10% of the rest. Would we see this again if we picked another 250 individuals?
Let’s create a dataset with these percentages:
disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),
labels=c("control","cases"))
genotype=factor(c(rep("AA/Aa",200),rep("aa",50)),
levels=c("AA/Aa","aa"))
dat <- data.frame(disease, genotype)
dat <- dat[sample(nrow(dat)),] #shuffle them up
head(dat)
## disease genotype
## 54 control AA/Aa
## 202 control aa
## 216 control aa
## 19 control AA/Aa
## 41 control AA/Aa
## 55 control AA/Aa
To create the appropriate two by two table, we will use the function
table. This function tabulates the frequency of each level
in a factor. For example:
table(genotype)
## genotype
## AA/Aa aa
## 200 50
table(disease)
## disease
## control cases
## 220 30
If you provide the function with two factors, it will tabulate all possible pairs and thus create the two by two table:
tab <- table(genotype,disease)
tab
## disease
## genotype control cases
## AA/Aa 180 20
## aa 40 10
Note that you can feed table \(n\) factors and it will tabulate all \(n\)-tables.
The typical statistics we use to summarize these results is the odds ratio (OR). We compute the odds of having the disease if you are an “aa”: 10/40, the odds of having the disease if you are an “AA/Aa”: 20/180, and take the ratio: \((10/40) / (20/180)\)
(tab[2,2]/tab[2,1]) / (tab[1,2]/tab[1,1])
## [1] 2.25
To compute a p-value, we don’t use the OR directly. We instead assume that there is no association between genotype and disease, and then compute what we expect to see in each cell of the table (note: this use of the word “cell” refers to elements in a matrix or table and has nothing to do with biological cells). Under the null hypothesis, the group with 200 individuals and the group with 50 individuals were each randomly assigned the disease with the same probability. If this is the case, then the probability of disease is:
p=mean(disease=="cases")
p
## [1] 0.12
The expected table is therefore:
expected <- rbind(c(1-p,p)*sum(genotype=="AA/Aa"),
c(1-p,p)*sum(genotype=="aa"))
dimnames(expected)<-dimnames(tab)
expected
## disease
## genotype control cases
## AA/Aa 176 24
## aa 44 6
The Chi-square test uses an asymptotic result (similar to the CLT) related to the sums of independent binary outcomes. Using this approximation, we can compute the probability of seeing a deviation from the expected table as big as the one we saw. The p-value for this table is:
chisq.test(tab)$p.value
## [1] 0.08857435
As mentioned earlier, reporting only p-values is not an appropriate way to report the results of your experiment. Many genetic association studies seem to overemphasize p-values. They have large sample sizes and report impressively small p-values. Yet when one looks closely at the results, we realize odds ratios are quite modest: barely bigger than 1. In this case the difference of having genotype AA/Aa or aa might not change an individual’s risk for a disease in an amount which is practically significant, in that one might not change one’s behavior based on the small increase in risk.
There is not a one-to-one relationship between the odds ratio and the p-value. To demonstrate, we recalculate the p-value keeping all the proportions identical, but increasing the sample size by 10, which reduces the p-value substantially (as we saw with the t-test under the alternative hypothesis):
tab<-tab*10
chisq.test(tab)$p.value
## [1] 1.219624e-09
Computing confidence intervals for the OR is not mathematically straightforward. Unlike other statistics, for which we can derive useful approximations of their distributions, the OR is not only a ratio, but a ratio of ratios. Therefore, there is no simple way of using, for example, the CLT.
One approach is to use the theory of generalized linear models which provides estimates of the log odds ratio, rather than the OR itself, that can be shown to be asymptotically normal. Here we provide R code without presenting the theoretical details (for further details please see a reference on generalized linear models such as Wikipedia or McCullagh and Nelder, 1989):
fit <- glm(disease~genotype,family="binomial",data=dat)
coeftab<- summary(fit)$coef
coeftab
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1972246 0.2356828 -9.322803 1.133070e-20
## genotypeaa 0.8109302 0.4249074 1.908487 5.632834e-02
The second row of the table shown above gives you the estimate and SE of the log odds ratio. Mathematical theory tells us that this estimate is approximately normally distributed. We can therefore form a confidence interval and then exponentiate to provide a confidence interval for the OR.
ci <- coeftab[2,1] + c(-2,2)*coeftab[2,2]
exp(ci)
## [1] 0.9618616 5.2632310
The confidence includes 1, which is consistent with the p-value being bigger than 0.05. Note that the p-value shown here is based on a different approximation to the one used by the Chi-square test, which is why they differ.
The Biostrings package contains object classes for representing DNA, RNA and amino acid sequences. It also contains functions for operating on those objects. We will focus on DNA sequences for now. DNA sequences are represented by DNAString objects. There are also RNAString and AAString objects with similar properties for representing RNA and protein, and many of the functions shown here work for all kinds of Biostrings.
library(Biostrings)
This command defines a DNAString:
# define a DNAString
dna <- DNAString("TCGAGCAAT")
dna
## 9-letter DNAString object
## seq: TCGAGCAAT
You can measure the length of a DNAString:
length(dna) # number of bases in a DNAString
## [1] 9
Not all characters are allowed in DNAString objects. Defining a DNAString with invalid characters will produce an error:
DNAString("JQX") # error - invalid bases
In practice, the most common characters in DNAStrings are the four bases (ACGT), the wild card or unknown base (N), and the dash representing a gap (-). Letters from the extended IUPAC genetic code are allowed, but they are rare in sequencing data.
DNAString("NNNACGCGC-TTA-CGGGCTANN") # valid sequence with unknowns and gaps
## 23-letter DNAString object
## seq: NNNACGCGC-TTA-CGGGCTANN
You can index into a DNAString with [ to
extract a substring:
dna[4:6] # extract a substring
## 3-letter DNAString object
## seq: AGC
Sometimes you may need to convert DNAStrings back into
character strings for manipulation in R. You can do this with
as.character:
as.character(dna) # convert DNAString to character
## [1] "TCGAGCAAT"
It is often convenient to analyze multiple DNA sequences at once. You can combine multiple sequences into a single object as a DNAStringSet:
set1 <- DNAStringSet(c("TCA", "AAATCG", "ACGTGCCTA", "CGCGCA", "GTT", "TCA")) # define a DNAStringSet
set1
## DNAStringSet object of length 6:
## width seq
## [1] 3 TCA
## [2] 6 AAATCG
## [3] 9 ACGTGCCTA
## [4] 6 CGCGCA
## [5] 3 GTT
## [6] 3 TCA
Note that indexing a DNAStringSet with [
extracts whole sequences from the set, not subsequences:
set1[2:3] # extract subset of sequences
## DNAStringSet object of length 2:
## width seq
## [1] 6 AAATCG
## [2] 9 ACGTGCCTA
Use double brackets [[ to extract single
DNAStrings from a DNAStringSet:
set1[[4]] # extract one sequence as a single DNAString
## 6-letter DNAString object
## seq: CGCGCA
Using length() on a DNAStringSet returns the
length of the set, not the size of each sequence:
length(set1) # number of DNAstrings in set
## [1] 6
The width() function returns the size of each individual
sequence in a DNAStringSet:
width(set1) # size of each DNAString
## [1] 3 6 9 6 3 3
You can detect duplicated DNA sequences with
duplicated() and keep only unique sequences with
unique().
duplicated(set1) # detect which sequences are duplicated
## [1] FALSE FALSE FALSE FALSE FALSE TRUE
unique(set1) # keep only unique sequences
## DNAStringSet object of length 5:
## width seq
## [1] 3 TCA
## [2] 6 AAATCG
## [3] 9 ACGTGCCTA
## [4] 6 CGCGCA
## [5] 3 GTT
You can also sort() sequences alphabetically.
sort(set1)
## DNAStringSet object of length 6:
## width seq
## [1] 6 AAATCG
## [2] 9 ACGTGCCTA
## [3] 6 CGCGCA
## [4] 3 GTT
## [5] 3 TCA
## [6] 3 TCA
Consider the DNA sequence ATCGCGCGCGGCTCTTTTAAAAAAACGCTACTACCATGTGTGTCTATC.
dna_seq <- DNAString("ATCGCGCGCGGCTCTTTTAAAAAAACGCTACTACCATGTGTGTCTATC")
letterFrequency() counts the number of times a specific
letter appears in a Biostring. This command can also be used on
other Biostrings types, not only DNAstrings.
letterFrequency(dna_seq, "A") # count A in sequence
## A
## 12
If multiple letters are given to letterFrequency(), it
counts the combined number of times those letters appear. For example,
this command counts the frequency of G or C.
letterFrequency(dna_seq, "GC") # count G or C in sequence
## G|C
## 22
You can also determine the frequency of all dinucleotides or trinucleotides in sliding windows across the sequence. For example, the dinucleotide CG appears 5 times and the trinucleotide TTT appears 2 times.
dinucleotideFrequency(dna_seq) # frequencies of all dinucleotides
## AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
## 6 3 0 3 1 1 5 5 0 5 1 3 4 4 3 3
trinucleotideFrequency(dna_seq) # frequencies of all trinucleotides
## AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT
## 5 1 0 0 0 1 1 1 0 0 0 0 0 2 1 0 0 0 0 1
## CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT
## 1 0 0 0 0 4 1 0 3 1 0 1 0 0 0 0 0 0 3 2
## GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT
## 0 1 0 0 0 1 2 0 1 2 0 1 0 0 1 2 0 0 0 3
## TTA TTC TTG TTT
## 1 0 0 2
You can find the reverse complement of a DNAString in a
single step with reverseComplement():
reverseComplement(dna_seq) # find reverse complement
## 48-letter DNAString object
## seq: GATAGACACACATGGTAGTAGCGTTTTTTTAAAAGAGCCGCGCGCGAT
You can also find the amino acid translation of a DNAString
in a single step with translate():
translate(dna_seq) # amino acid translation
## 16-letter AAString object
## seq: IARGSFKKTLLPCVSI
Two common tasks with Biostrings are to count the number of occurrences of a pattern and find the location of those patterns within a Biostring or set of Biostrings. We will consider these commands on DNAStrings as an example, but they work similarly for all kinds of Biostrings.
dna_seq <- DNAString("ATCGCGCGCGGCTCTTTTAAAAAAACGCTACTACCATGTGTGTCTATC")
dna_seq
## 48-letter DNAString object
## seq: ATCGCGCGCGGCTCTTTTAAAAAAACGCTACTACCATGTGTGTCTATC
countPattern() takes two arguments: a pattern and a
Biostring. It returns the number of times that pattern appears
within the Biostring.
countPattern("CG", dna_seq) # pattern "CG" occurs 5 times
## [1] 5
matchPattern() also takes a pattern and a
Biostring as arguments. It returns the locations of each
pattern occurrence within the Biostring.
matchPattern("CG", dna_seq) # locations of pattern "CG"
## Views on a 48-letter DNAString subject
## subject: ATCGCGCGCGGCTCTTTTAAAAAAACGCTACTACCATGTGTGTCTATC
## views:
## start end width
## [1] 3 4 2 [CG]
## [2] 5 6 2 [CG]
## [3] 7 8 2 [CG]
## [4] 9 10 2 [CG]
## [5] 26 27 2 [CG]
matchPattern() returns a Views object, which is similar
to an IRanges. For example, you can find the start location of each
pattern occurrence with start().
start(matchPattern("CG", dna_seq)) # start locations of the pattern
## [1] 3 5 7 9 26
You can count or match patterns of any length.
matchPattern("CTCTTTTAAAAAAACGCTACTACCATGTGT", dna_seq) # match patterns of any length
## Views on a 48-letter DNAString subject
## subject: ATCGCGCGCGGCTCTTTTAAAAAAACGCTACTACCATGTGTGTCTATC
## views:
## start end width
## [1] 12 41 30 [CTCTTTTAAAAAAACGCTACTACCATGTGT]
Because DNA is double-stranded, sometimes you might want to consider whether the reverse complement of a pattern matches your string. For example, this can help you determine whether a primer binds to a certain DNA sequence. Combine DNAString operations to achieve this.
# check for pattern and its reverse complement
countPattern("TAG", dna_seq)
## [1] 0
countPattern(reverseComplement(DNAString("TAG")), dna_seq)
## [1] 3
You can also count and locate patterns in DNAStringSet objects, but the commands and result formats are slightly different.
set2 <- DNAStringSet(c("AACCGGTTTCGA", "CATGCTGCTACA", "CGATCGCGCCGG", "TACAACCGTACA"))
set2
## DNAStringSet object of length 4:
## width seq
## [1] 12 AACCGGTTTCGA
## [2] 12 CATGCTGCTACA
## [3] 12 CGATCGCGCCGG
## [4] 12 TACAACCGTACA
vcountPattern() counts the number of occurrences of a
single pattern across many Biostrings. It returns a vector
listing the number of occurrences in each element of a string set.
vcountPattern("CG", set2) # CG counts for entire DNAStringSet
## [1] 2 0 4 1
vmatchPattern() returns the locations of a pattern
across many Biostrings. The results are formatted similarly to
a list of IRanges.
vmatchPattern("CG", set2)
## MIndex object of length 4
## [[1]]
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 4 5 2
## [2] 10 11 2
##
## [[2]]
## IRanges object with 0 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
##
## [[3]]
## IRanges object with 4 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 2 2
## [2] 5 6 2
## [3] 7 8 2
## [4] 10 11 2
##
## [[4]]
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 7 8 2
To work with matches from a single string in a set, index into the
vmatchPattern() results with double brackets
[[.
vmatchPattern("CG", set2)[[1]] # access matches for the first element of the DNAStringSet
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 4 5 2
## [2] 10 11 2
Let’s use our data to see how well the central limit theorem approximates sample averages from our data. We will leverage our entire population dataset to compare the results we obtain by actually sampling from the distribution to what the CLT predicts.
dat <- read.csv("mice_pheno.csv") #file was previously downloaded
head(dat)
## Sex Diet Bodyweight
## 1 F hf 31.94
## 2 F hf 32.48
## 3 F hf 22.82
## 4 F hf 19.92
## 5 F hf 32.22
## 6 F hf 27.50
Start by selecting only female mice since males and females have different weights. We will select three mice from each population.
library(dplyr)
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%
select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%
select(Bodyweight) %>% unlist
We can compute the population parameters of interest using the mean function.
mu_hf <- mean(hfPopulation)
mu_control <- mean(controlPopulation)
print(mu_hf - mu_control)
## [1] 2.375517
We can compute the population standard deviations of, say, a vector
\(x\) as well. However, we do not use
the R function sd because this function actually does not
compute the population standard deviation \(\sigma_x\). Instead, sd
assumes the main argument is a random sample, say \(X\), and provides an estimate of \(\sigma_x\), defined by \(s_X\) above. As shown in the equations
above the actual final answer differs because one divides by the sample
size and the other by the sample size minus one. We can see that with R
code:
x <- controlPopulation
N <- length(x)
populationvar <- mean((x-mean(x))^2)
identical(var(x), populationvar)
## [1] FALSE
identical(var(x)*(N-1)/N, populationvar)
## [1] TRUE
So to be mathematically correct, we do not use sd or
var. Instead, we use the popvar and
popsd function in rafalib:
library(rafalib)
sd_hf <- popsd(hfPopulation)
sd_control <- popsd(controlPopulation)
Remember that in practice we do not get to compute these population parameters. These are values we never see. In general, we want to estimate them from samples.
N <- 12
hf <- sample(hfPopulation, 12)
control <- sample(controlPopulation, 12)
As we described, the CLT tells us that for large \(N\), each of these is approximately normal with average population mean and standard error population variance divided by \(N\). We mentioned that a rule of thumb is that \(N\) should be 30 or more. However, that is just a rule of thumb since the preciseness of the approximation depends on the population distribution. Here we can actually check the approximation and we do that for various values of \(N\).
Now we use sapply and replicate instead of
for loops, which makes for cleaner code (we do not have to
pre-allocate a vector, R takes care of this for us):
Ns <- c(3,12,25,50)
B <- 10000 #number of simulations
res <- sapply(Ns,function(n) {
replicate(B,mean(sample(hfPopulation,n))-mean(sample(controlPopulation,n)))
})
Now we can use qq-plots to see how well CLT approximations works for these. If in fact the normal distribution is a good approximation, the points should fall on a straight line when compared to normal quantiles. The more it deviates, the worse the approximation. In the title, we also show the average and SD of the observed distribution, which demonstrates how the SD decreases with \(\sqrt{N}\) as predicted.
mypar(2,2)
for (i in seq(along=Ns)) {
titleavg <- signif(mean(res[,i]),3)
titlesd <- signif(popsd(res[,i]),3)
title <- paste0("N=",Ns[i]," Avg=",titleavg," SD=",titlesd)
qqnorm(res[,i],main=title)
qqline(res[,i],col=2)
}
Quantile versus quantile plot of simulated differences versus theoretical normal distribution for four different sample sizes.
Here we see a pretty good fit even for 3. Why is this? Because the population itself is relatively close to normally distributed, the averages are close to normal as well (the sum of normals is also a normal). In practice, we actually calculate a ratio: we divide by the estimated standard deviation. Here is where the sample size starts to matter more.
Ns <- c(3,12,25,50)
B <- 10000 #number of simulations
##function to compute a t-stat
computetstat <- function(n) {
y <- sample(hfPopulation,n)
x <- sample(controlPopulation,n)
(mean(y)-mean(x))/sqrt(var(y)/n+var(x)/n)
}
res <- sapply(Ns,function(n) {
replicate(B,computetstat(n))
})
mypar(2,2)
for (i in seq(along=Ns)) {
qqnorm(res[,i],main=Ns[i])
qqline(res[,i],col=2)
}
Quantile versus quantile plot of simulated ratios versus theoretical normal distribution for four different sample sizes.
So we see that for \(N=3\), the CLT does not provide a usable approximation. For \(N=12\), there is a slight deviation at the higher values, although the approximation appears useful. For 25 and 50, the approximation is spot on.
This simulation only proves that \(N=12\) is large enough in this case, not in general. As mentioned above, we will not be able to perform this simulation in most situations. We only use the simulation to illustrate the concepts behind the CLT and its limitations. In future sections, we will describe the approaches we actually use in practice.
Below we will discuss the Central Limit Theorem (CLT) and the t-distribution, both of which help us make important calculations related to probabilities. Both are frequently used in science to test statistical hypotheses. To use these, we have to make different assumptions from those for the CLT and the t-distribution. However, if the assumptions are true, then we are able to calculate the exact probabilities of events through the use of mathematical formula.
The CLT is one of the most frequently used mathematical results in science. It tells us that when the sample size is large, the average \(\bar{Y}\) of a random sample follows a normal distribution centered at the population average \(\mu_Y\) and with standard deviation equal to the population standard deviation \(\sigma_Y\), divided by the square root of the sample size \(N\). We refer to the standard deviation of the distribution of a random variable as the random variable’s standard error.
Please note that if we subtract a constant from a random variable, the mean of the new random variable shifts by that constant. Mathematically, if \(X\) is a random variable with mean \(\mu\) and \(a\) is a constant, the mean of \(X - a\) is \(\mu-a\). A similarly intuitive result holds for multiplication and the standard deviation (SD). If \(X\) is a random variable with mean \(\mu\) and SD \(\sigma\), and \(a\) is a constant, then the mean and SD of \(aX\) are \(a \mu\) and \(\mid a \mid \sigma\) respectively. To see how intuitive this is, imagine that we subtract 10 grams from each of the mice weights. The average weight should also drop by that much. Similarly, if we change the units from grams to milligrams by multiplying by 1000, then the spread of the numbers becomes larger.
This implies that if we take many samples of size \(N\), then the quantity:
\[ \frac{\bar{Y} - \mu}{\sigma_Y/\sqrt{N}} \]
is approximated with a normal distribution centered at 0 and with standard deviation 1.
Now we are interested in the difference between two sample averages. Here again a mathematical result helps. If we have two random variables \(X\) and \(Y\) with means \(\mu_X\) and \(\mu_Y\) and variance \(\sigma_X\) and \(\sigma_Y\) respectively, then we have the following result: the mean of the sum \(Y + X\) is the sum of the means \(\mu_Y + \mu_X\). Using one of the facts we mentioned earlier, this implies that the mean of \(Y - X = Y + aX\) with \(a = -1\) , which implies that the mean of \(Y - X\) is \(\mu_Y - \mu_X\). This is intuitive. However, the next result is perhaps not as intuitive. If \(X\) and \(Y\) are independent of each other, as they are in our mouse example, then the variance (SD squared) of \(Y + X\) is the sum of the variances \(\sigma_Y^2 + \sigma_X^2\). This implies that variance of the difference \(Y - X\) is the variance of \(Y + aX\) with \(a = -1\) which is \(\sigma^2_Y + a^2 \sigma_X^2 = \sigma^2_Y + \sigma_X^2\). So the variance of the difference is also the sum of the variances. If this seems like a counterintuitive result, remember that if \(X\) and \(Y\) are independent of each other, the sign does not really matter. It can be considered random: if \(X\) is normal with certain variance, for example, so is \(-X\). Finally, another useful result is that the sum of normal variables is again normal.
All this math is very helpful for the purposes of our study because we have two sample averages and are interested in the difference. Because both are normal, the difference is normal as well, and the variance (the standard deviation squared) is the sum of the two variances. Under the null hypothesis that there is no difference between the population averages, the difference between the sample averages \(\bar{Y}-\bar{X}\), with \(\bar{X}\) and \(\bar{Y}\) the sample average for the two diets respectively, is approximated by a normal distribution centered at 0 (there is no difference) and with standard deviation \(\sqrt{\sigma_X^2 +\sigma_Y^2}/\sqrt{N}\).
This suggests that this ratio:
\[ \frac{\bar{Y}-\bar{X}}{\sqrt{\frac{\sigma_X^2}{M} + \frac{\sigma_Y^2}{N}}} \]
is approximated by a normal distribution centered at 0 and standard deviation 1. Using this approximation makes computing p-values simple because we know the proportion of the distribution under any value. For example, only 5% of these values are larger than 2 (in absolute value):
pnorm(-2) + (1 - pnorm(2))
## [1] 0.04550026
We don’t need to buy more mice, 12 and 12 suffice.
However, we can’t claim victory just yet because we don’t know the population standard deviations: \(\sigma_X\) and \(\sigma_Y\). These are unknown population parameters, but we can get around this by using the sample standard deviations, call them \(s_X\) and \(s_Y\). These are defined as:
\[ s_X^2 = \frac{1}{M-1} \sum_{i=1}^M (X_i - \bar{X})^2 \mbox{ and } s_Y^2 = \frac{1}{N-1} \sum_{i=1}^N (Y_i - \bar{Y})^2 \]
Note that we are dividing by \(M-1\) and \(N-1\), instead of by \(M\) and \(N\). There is a theoretical reason for doing this which we do not explain here. But to get an intuition, think of the case when you just have 2 numbers. The average distance to the mean is basically 1/2 the difference between the two numbers. So you really just have information from one number. This is somewhat of a minor point. The main point is that \(s_X\) and \(s_Y\) serve as estimates of \(\sigma_X\) and \(\sigma_Y\)
So we can redefine our ratio as
\[ \sqrt{N} \frac{\bar{Y}-\bar{X}}{\sqrt{s_X^2 +s_Y^2}} \]
if \(M=N\) or in general,
\[ \frac{\bar{Y}-\bar{X}}{\sqrt{\frac{s_X^2}{M} + \frac{s_Y^2}{N}}} \]
The CLT tells us that when \(M\) and
\(N\) are large, this random variable
is normally distributed with mean 0 and SD 1. Thus we can compute
p-values using the function pnorm.
The CLT relies on large samples, what we refer to as asymptotic results. When the CLT does not apply, there is another option that does not rely on asymptotic results. When the original population from which a random variable, say \(Y\), is sampled is normally distributed with mean 0, then we can calculate the distribution of:
\[ \sqrt{N} \frac{\bar{Y}}{s_Y} \]
This is the ratio of two random variables so it is not necessarily normal. The fact that the denominator can be small by chance increases the probability of observing large values. William Sealy Gosset, an employee of the Guinness brewing company, deciphered the distribution of this random variable and published a paper under the pseudonym “Student”. The distribution is therefore called Student’s t-distribution. Later we will learn more about how this result is used.
Here we will use the mice phenotype data as an example. We start by creating two vectors, one for the control population and one for the high-fat diet population:
library(dplyr)
dat <- read.csv("mice_pheno.csv") #We downloaded this file in a previous section
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%
select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%
select(Bodyweight) %>% unlist
It is important to keep in mind that what we are assuming to be normal here is the distribution of \(y_1,y_2,\dots,y_n\), not the random variable \(\bar{Y}\). Although we can’t do this in practice, in this illustrative example, we get to see this distribution for both controls and high fat diet mice:
library(rafalib)
mypar(1,2)
hist(hfPopulation)
hist(controlPopulation)
Histograms of all weights for both populations.
We can use qq-plots to confirm that the distributions are relatively close to being normally distributed. We will explore these plots in more depth in a later section, but the important thing to know is that it compares data (on the y-axis) against a theoretical distribution (on the x-axis). If the points fall on the identity line, then the data is close to the theoretical distribution.
mypar(1,2)
qqnorm(hfPopulation)
qqline(hfPopulation)
qqnorm(controlPopulation)
qqline(controlPopulation)
Quantile-quantile plots of all weights for both populations.
The larger the sample, the more forgiving the result is to the weakness of this approximation. In the next section, we will see that for this particular dataset the t-distribution works well even for sample sizes as small as 3.
One distinguishing characteristic of high-throughput data is that although we want to report on specific features, we observe many related outcomes. For example, we measure the expression of thousands of genes, or the height of thousands of peaks representing protein binding, or the methylation levels across several CpGs. However, most of the statistical inference approaches we have shown here treat each feature independently and pretty much ignores data from other features. We will learn how using statistical models provides power by modeling features jointly. The most successful of these approaches are what we refer to as hierarchical models, which we explain below in the context of Bayesian statistics.
We start by reviewing Bayes theorem. We do this using a hypothetical cystic fibrosis test as an example. Suppose a test for cystic fibrosis has an accuracy of 99%. We will use the following notation:
\[ \mbox{Prob}(+ \mid D=1)=0.99, \mbox{Prob}(- \mid D=0)=0.99 \]
with \(+\) meaning a positive test and \(D\) representing if you actually have the disease (1) or not (0).
Suppose we select a random person and they test positive, what is the probability that they have the disease? We write this as \(\mbox{Prob}(D=1 \mid +)?\) The cystic fibrosis rate is 1 in 3,900 which implies that \(\mbox{Prob}(D=1)=0.00025\). To answer this question we will use Bayes Theorem, which in general tells us that:
\[ \mbox{Pr}(A \mid B) = \frac{\mbox{Pr}(B \mid A)\mbox{Pr}(A)}{\mbox{Pr}(B)} \]
This equation applied to our problem becomes:
\[ \begin{align*} \mbox{Prob}(D=1 \mid +) & = \frac{ P(+ \mid D=1) \cdot P(D=1)} {\mbox{Prob}(+)} \\ & = \frac{\mbox{Prob}(+ \mid D=1)\cdot P(D=1)} {\mbox{Prob}(+ \mid D=1) \cdot P(D=1) + \mbox{Prob}(+ \mid D=0) \mbox{Prob}( D=0)} \end{align*} \]
Plugging in the numbers we get:
\[ \frac{0.99 \cdot 0.00025}{0.99 \cdot 0.00025 + 0.01 \cdot (.99975)} = 0.02 \]
This says that despite the test having 0.99 accuracy, the probability of having the disease given a positive test is only 0.02. This may appear counterintuitive to some. The reason this is the case is because we have to factor in the very rare probability that a person, chosen at random, has the disease. To illustrate this we run a Monte Carlo simulation.
The following simulation is meant to help you visualize Bayes Theorem. We start by randomly selecting 1500 people from a population in which the disease in question has a 5% prevalence.
set.seed(3)
prev <- 1/20
##Later, we are arranging 1000 people in 80 rows and 20 columns
M <- 50 ; N <- 30
##do they have the disease?
d<-rbinom(N*M,1,p=prev)
Now each person gets the test which is correct 90% of the time.
accuracy <- 0.9
test <- rep(NA,N*M)
##do controls test positive?
test[d==1] <- rbinom(sum(d==1), 1, p=accuracy)
##do cases test positive?
test[d==0] <- rbinom(sum(d==0), 1, p=1-accuracy)
Because there are so many more controls than cases, even with a low false positive rate, we get more controls than cases in the group that tested positive (code not shown):
Simulation demonstrating Bayes theorem. Top plot shows every individual with red denoting cases. Each one takes a test and with 90% gives correct answer. Those called positive (either correctly or incorrectly) are put in the bottom left pane. Those called negative in the bottom right.
The proportions of red in the top plot shows \(\mbox{Pr}(D=1)\). The bottom left shows \(\mbox{Pr}(D=1 \mid +)\) and the bottom right shows \(\mbox{Pr}(D=0 \mid +)\).
José Iglesias is a professional baseball player. In April 2013, when he was starting his career, he was performing rather well:
| Month | At Bats | H | AVG |
|---|---|---|---|
| April | 20 | 9 | .450 |
The batting average (AVG) statistic is one way of
measuring success. Roughly speaking, it tells us the success rate when
batting. An AVG of .450 means José has been successful 45%
of the times he has batted (At Bats) which is rather high
as we will see. Note, for example, that no one has finished a season
with an AVG of .400 since Ted Williams did it in 1941! To
illustrate the way hierarchical models are powerful, we will try to
predict José’s batting average at the end of the season, after he has
gone to bat over 500 times.
With the techniques we have learned up to now, referred to as frequentist techniques, the best we can do is provide a confidence interval. We can think of outcomes from hitting as a binomial with a success rate of \(p\). So if the success rate is indeed .450, the standard error of just 20 at bats is:
\[ \sqrt{\frac{.450 (1-.450)}{20}}=.111 \]
This means that our confidence interval is .450-.222 to .450+.222 or .228 to .672.
This prediction has two problems. First, it is very large so not very useful. Also, it is centered at .450 which implies that our best guess is that this new player will break Ted Williams’ record. If you follow baseball, this last statement will seem wrong and this is because you are implicitly using a hierarchical model that factors in information from years of following baseball. Here we show how we can quantify this intuition.
We note that the average player had an AVG of .275 and
the standard deviation of the population of players was 0.027. So we can
see already that .450 would be quite an anomaly since it is over six SDs
away from the mean. So is José lucky or the best batter seen in the last
50 years? Perhaps it’s a combination of both. But how lucky and how good
is he? If we become convinced that he is lucky, we should trade him to a
team that trusts the .450 observation and is maybe overestimating his
potential.
The hierarchical model provides a mathematical description of how we came to see the observation of .450. First, we pick a player at random with an intrinsic ability summarized by, for example, \(\theta\), then we see 20 random outcomes with success probability \(\theta\).
\[ \begin{align*} \theta &\sim N(\mu, \tau^2) \mbox{ describes randomness in picking a player}\\ Y \mid \theta &\sim N(\theta, \sigma^2) \mbox{ describes randomness in the performance of this particular player} \end{align*} \]
Note the two levels (this is why we call them hierarchical): 1) Player to player variability and 2) variability due to luck when batting. In a Bayesian framework, the first level is called a prior distribution and the second the sampling distribution.
Now, let’s use this model for José’s data. Suppose we want to predict his innate ability in the form of his true batting average \(\theta\). This would be the hierarchical model for our data:
\[ \begin{align*} \theta &\sim N(.275, .027^2) \\ Y \mid \theta &\sim N(\theta, .111^2) \end{align*} \]
We now are ready to compute a posterior distribution to summarize our prediction of \(\theta\). The continuous version of Bayes rule can be used here to derive the posterior probability, which is the distribution of the parameter \(\theta\) given the observed data:
\[ \begin{align*} f_{ \theta \mid Y} (\theta\mid Y) &= \frac{f_{Y\mid \theta}(Y\mid \theta) f_{\theta}(\theta) }{f_Y(Y)}\\ &= \frac{f_{Y\mid \theta}(Y\mid \theta) f_{\theta}(\theta)} {\int_{\theta}f_{Y\mid \theta}(Y\mid \theta)f_{\theta}(\theta)} \end{align*} \]
We are particularly interested in the \(\theta\) that maximizes the posterior probability \(f_{\theta\mid Y}(\theta\mid Y)\). In our case, we can show that the posterior is normal and we can compute the mean \(\mbox{E}(\theta\mid y)\) and variance \(\mbox{var}(\theta\mid y)\). Specifically, we can show the average of this distribution is the following:
\[ \begin{align*} \mbox{E}(\theta\mid y) &= B \mu + (1-B) Y\\ &= \mu + (1-B)(Y-\mu)\\ B &= \frac{\sigma^2}{\sigma^2+\tau^2} \end{align*} \]
It is a weighted average of the population average \(\mu\) and the observed data \(Y\). The weight depends on the SD of the population \(\tau\) and the SD of our observed data \(\sigma\). This weighted average is sometimes referred to as shrinking because it shrinks estimates towards a prior mean. In the case of José Iglesias, we have:
\[ \begin{align*} \mbox{E}(\theta \mid Y=.450) &= B \times .275 + (1 - B) \times .450 \\ &= .275 + (1 - B)(.450 - .275) \\ B &=\frac{.111^2}{.111^2 + .027^2} = 0.944\\ \mbox{E}(\theta \mid Y=450) &\approx .285 \end{align*} \]
The variance can be shown to be:
\[ \mbox{var}(\theta\mid y) = \frac{1}{1/\sigma^2+1/\tau^2} = \frac{1}{1/.111^2 + 1/.027^2} = 0.00069 \] and the standard deviation is therefore \(0.026\). So we started with a frequentist 95% confidence interval that ignored data from other players and summarized just José’s data: .450 \(\pm\) 0.220. Then we used a Bayesian approach that incorporated data from other players and other years to obtain a posterior probability. This is actually referred to as an empirical Bayes approach because we used data to construct the prior. From the posterior we can report what is called a 95% credible interval by reporting a region, centered at the mean, with a 95% chance of occurring. In our case, this turns out to be: .285 \(\pm\) 0.052.
The Bayesian credible interval suggests that if another team is impressed by the .450 observation, we should consider trading José as we are predicting he will be just slightly above average. Interestingly, the Red Sox traded José to the Detroit Tigers in July. Here are the José Iglesias batting averages for the next five months.
| Month | At Bat | Hits | AVG |
|---|---|---|---|
| April | 20 | 9 | .450 |
| May | 26 | 11 | .423 |
| June | 86 | 34 | .395 |
| July | 83 | 17 | .205 |
| August | 85 | 25 | .294 |
| September | 50 | 10 | .200 |
| Total w/o April | 330 | 97 | .293 |
Although both intervals included the final batting average, the Bayesian credible interval provided a much more precise prediction. In particular, it predicted that he would not be as good the remainder of the season.
If an experiment is designed incorrectly we may not be able to estimate the parameters of interest. Similarly, when analyzing data we may incorrectly decide to use a model that can’t be fit. If we are using linear models then we can detect these problems mathematically by looking for collinearity in the design matrix.
The following system of equations:
\[ \begin{align*} a+c &=1\\ b-c &=1\\ a+b &=2 \end{align*} \]
has more than one solution since there are an infinite number of triplets that satisfy \(a=1-c, b=1+c\). Two examples are \(a=1,b=1,c=0\) and \(a=0,b=2,c=1\).
The system of equations above can be written like this:
\[ \, \begin{pmatrix} 1&0&1\\ 0&1&-1\\ 1&1&0\\ \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 1\\ 1\\ 2 \end{pmatrix} \]
Note that the third column is a linear combination of the first two:
\[ \, \begin{pmatrix} 1\\ 0\\ 1 \end{pmatrix} + -1 \begin{pmatrix} 0\\ 1\\ 1 \end{pmatrix} = \begin{pmatrix} 1\\ -1\\ 0 \end{pmatrix} \]
We say that the third column is collinear with the first 2. This implies that the system of equations can be written like this:
\[ \, \begin{pmatrix} 1&0&1\\ 0&1&-1\\ 1&1&0 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = a \begin{pmatrix} 1\\ 0\\ 1 \end{pmatrix} + b \begin{pmatrix} 0\\ 1\\ 1 \end{pmatrix} + c \begin{pmatrix} 1-0\\ 0-1\\ 1-1 \end{pmatrix} \]
\[ =(a+c) \begin{pmatrix} 1\\ 0\\ 1\\ \end{pmatrix} + (b-c) \begin{pmatrix} 0\\ 1\\ 1\\ \end{pmatrix} \]
The third column does not add a constraint and what we really have are three equations and two unknowns: \(a+c\) and \(b-c\). Once we have values for those two quantities, there are an infinity number of triplets that can be used.
Consider a design matrix \(\mathbf{X}\) with two collinear columns. Here we create an extreme example in which one column is the opposite of another:
\[ \mathbf{X} = \begin{pmatrix} \mathbf{1}&\mathbf{X}_1&\mathbf{X}_2&\mathbf{X}_3\\ \end{pmatrix} \mbox{ with, say, } \mathbf{X}_3 = - \mathbf{X}_2 \]
This means that we can rewrite the residuals like this:
\[ \mathbf{Y}- \left\{ \mathbf{1}\beta_0 + \mathbf{X}_1\beta_1 + \mathbf{X}_2\beta_2 + \mathbf{X}_3\beta_3\right\}\\ = \mathbf{Y}- \left\{ \mathbf{1}\beta_0 + \mathbf{X}_1\beta_1 + \mathbf{X}_2\beta_2 - \mathbf{X}_2\beta_3\right\}\\ = \mathbf{Y}- \left\{\mathbf{1}\beta_0 + \mathbf{X}_1 \beta_1 + \mathbf{X}_2(\beta_2 - \beta_3)\right\} \]
and if \(\hat{\beta}_1\), \(\hat{\beta}_2\), \(\hat{\beta}_3\) is a least squares solution, then, for example, \(\hat{\beta}_1\), \(\hat{\beta}_2+1\), \(\hat{\beta}_3+1\) is also a solution.
Now we will demonstrate how collinearity helps us determine problems with our design using one of the most common errors made in current experimental design: confounding. To illustrate, let’s use an imagined experiment in which we are interested in the effect of four treatments A, B, C and D. We assign two mice to each treatment. After starting the experiment by giving A and B to female mice, we realize there might be a sex effect. We decide to give C and D to males with hopes of estimating this effect. But can we estimate the sex effect? The described design implies the following design matrix:
\[ \, \begin{pmatrix} Sex & A & B & C & D\\ 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 & 1\\ \end{pmatrix} \]
Here we can see that sex and treatment are confounded. Specifically, the sex column can be written as a linear combination of the C and D matrices.
\[ \, \begin{pmatrix} Sex \\ 0\\ 0 \\ 0 \\ 0 \\ 1\\ 1\\ 1 \\ 1 \\ \end{pmatrix} = \begin{pmatrix} C \\ 0\\ 0\\ 0\\ 0\\ 1\\ 1\\ 0\\ 0\\ \end{pmatrix} + \begin{pmatrix} D \\ 0\\ 0\\ 0\\ 0\\ 0\\ 0\\ 1\\ 1\\ \end{pmatrix} \]
This implies that a unique least squares estimate is not achievable.
The rank of a matrix columns is the number of columns that
are independent of all the others. If the rank is smaller than the
number of columns, then the LSE are not unique. In R, we can obtain the
rank of matrix with the function qr, which we will describe
in more detail in a following section.
Sex <- c(0,0,0,0,1,1,1,1)
A <- c(1,1,0,0,0,0,0,0)
B <- c(0,0,1,1,0,0,0,0)
C <- c(0,0,0,0,1,1,0,0)
D <- c(0,0,0,0,0,0,1,1)
X <- model.matrix(~Sex+A+B+C+D-1)
cat("ncol=",ncol(X),"rank=", qr(X)$rank,"\n")
## ncol= 5 rank= 4
Here we will not be able to estimate the effect of sex.
This particular experiment could have been designed better. Using the same number of male and female mice, we can easily design an experiment that allows us to compute the sex effect as well as all the treatment effects. Specifically, when we balance sex and treatments, the confounding is removed as demonstrated by the fact that the rank is now the same as the number of columns:
Sex <- c(0,1,0,1,0,1,0,1)
A <- c(1,1,0,0,0,0,0,0)
B <- c(0,0,1,1,0,0,0,0)
C <- c(0,0,0,0,1,1,0,0)
D <- c(0,0,0,0,0,0,1,1)
X <- model.matrix(~Sex+A+B+C+D-1)
cat("ncol=",ncol(X),"rank=", qr(X)$rank,"\n")
## ncol= 5 rank= 5
For biological annotation, generally sequence or gene based, there are three key types of package
wherever brackets are used, you must substitute an appropriate token. You can survey all annotation packages at the annotation page.
Packages Homo.sapiens, Mus.musculus and Rattus.norvegicus are specialized integrative annotation resources with an evolving interface.
Packages GO.db, KEGG.db, KEGGREST, and reactome.db are primarily intended as organism-independent resources organizing genes into groups. However, there are organism-specific mappings between gene-oriented annotation and these resources, that involve specific abbreviations and symbol conventions. These are described when these packages are used.
The standard Linnaean taxonomy is used very generally. So you need to know that
and so on. We use two sorts of abbreviations. For Biostrings-based packages, the contraction of first and second names is used
For NCBI-based annotation maps, we contract further
These packages have four-component names that specify the reference build used
These packages have five-component names that specify the reference build used and the gene catalog
Additional packages that are relevant are
These packages have four component names, with two components fixed. The variable components indicate organism and curating institution.
There are often alternative curating institutions available such as Ensembl.
In this unit, we will show the difference between using the simple
t-test and doing differential expression with the limma
hierarchical model. The reference is Smyth 2004,
listed in the footnotes.
Here we also show the basic steps for performing a limma
analysis. Note that the limma package is very powerful, and
has hundreds of pages of documentation which we cannot cover in this
course, however we recommend that users wanting to explore further
should check out this guide.
We start by loading the spike-in data which was introduced in lecture, which has already been normalized.
# BiocManager::install("SpikeInSubset")
library(SpikeInSubset)
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: affy
## Warning: package 'affy' was built under R version 4.3.1
data(rma95)
fac <- factor(rep(1:2,each=3))
We can now perform simple t-tests using the rowttests
function in the genefilter package:
library(genefilter)
rtt <- rowttests(exprs(rma95),fac)
We will define colors depending on whether the p-value is small, the absolute difference in means is large, and whether the feature is a spike-in value.
mask <- with(rtt, abs(dm) < .2 & p.value < .01)
spike <- rownames(rma95) %in% colnames(pData(rma95))
cols <- ifelse(mask,"red",ifelse(spike,"dodgerblue","black"))
We now plot the results, using the colors defined above. We multiply
the dm by -1, because we are interested in the difference
from the second group to the first (this is the difference used by
lm and the limma package by default). The
spike-in genes are in blue, which have mostly small p-value and large
difference in means. The red points indicate genes which have small
p-values but also small differences in means. We will see how these
points change after using limma.
with(rtt, plot(-dm, -log10(p.value), cex=.8, pch=16,
xlim=c(-1,1), ylim=c(0,5),
xlab="difference in means",
col=cols))
abline(h=2,v=c(-.2,.2), lty=2)
Note that the red genes have mostly low estimates of standard deviation.
rtt$s <- apply(exprs(rma95), 1, function(row) sqrt(.5 * (var(row[1:3]) + var(row[4:6]))))
with(rtt, plot(s, -log10(p.value), cex=.8, pch=16,
log="x",xlab="estimate of standard deviation",
col=cols))
The following three steps perform the basic limma
analysis. We specify coef=2 because we are interested in
the difference between groups, not the intercept.
library(limma)
##
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
fit <- lmFit(rma95, design=model.matrix(~ fac))
colnames(coef(fit))
## [1] "(Intercept)" "fac2"
fit <- eBayes(fit)
tt <- topTable(fit, coef=2)
tt
## logFC AveExpr t P.Value adj.P.Val B
## 1708_at -7.0610613 7.945276 -73.529269 7.816370e-17 9.868948e-13 8.646866
## 36202_at 0.8525527 9.373033 9.975114 4.935683e-07 3.115897e-03 4.587736
## 36311_at 0.8318298 8.564315 8.363252 3.017008e-06 1.269758e-02 3.567790
## 33264_at 0.7118997 4.918953 7.434888 9.666328e-06 2.706595e-02 2.835849
## 32660_at 0.6554022 8.680132 7.356180 1.071834e-05 2.706595e-02 2.768151
## 38734_at 0.7467142 6.255772 7.185131 1.345115e-05 2.830571e-02 2.617789
## 1024_at 0.8426550 9.697281 6.730664 2.503461e-05 4.400123e-02 2.195944
## 36085_at 0.6449402 12.193130 6.653830 2.787976e-05 4.400123e-02 2.121308
## 33818_at 0.5321749 12.285643 6.454504 3.699480e-05 5.189960e-02 1.923063
## 39058_at 0.6090625 7.534532 6.278815 4.767986e-05 5.687699e-02 1.742696
topTable will return the top genes ranked by whichever
value you define. You can also ask topTable to return all the values,
sorted by "none". Note that a column automatically is
included which gives the adjusted p-values for each gene. By
default the method of Benjamini-Hochberg is used, by calling the
p.adjust function.
# ?topTable
dim(topTable(fit, coef=2, number=Inf, sort.by="none"))
## [1] 12626 6
# ?p.adjust
Here we will compare the previous volcano plot with the
limma results. Note that the red points are now all under
the line where -log10(p.value) is equal to 2. Also, the
blue points which represent real differences have p-values which are
even higher than before.
limmares <- data.frame(dm=coef(fit)[,"fac2"], p.value=fit$p.value[,"fac2"])
with(limmares, plot(dm, -log10(p.value),cex=.8, pch=16,
col=cols,xlab="difference in means",
xlim=c(-1,1), ylim=c(0,5)))
abline(h=2,v=c(-.2,.2), lty=2)
Finally, we will construct a plot which shows how limma
shrinks the variance estimates towards a common value, eliminating false
positives which might arise from too-low estimates of variance.
Here we pick, for each of 40 bins of different variance estimates, a single gene which falls in that bin. We remove bins which do not have any such genes.
n <- 40
qs <- seq(from=0,to=.2,length=n)
idx <- sapply(seq_len(n),function(i) which(as.integer(cut(rtt$s^2,qs)) == i)[1])
idx <- idx[!is.na(idx)]
Now we will plot a line, from the initial estimate of variance for
these genes to the estimate after running limma.
par(mar=c(5,5,2,2))
plot(1,1,xlim=c(0,.21),ylim=c(0,1),type="n",
xlab="variance estimates",ylab="",yaxt="n")
axis(2,at=c(.1,.9),c("before","after"),las=2)
segments((rtt$s^2)[idx],rep(.1,n),
fit$s2.post[idx],rep(.9,n))
Smyth GK, “Linear models and empirical bayes methods for assessing differential expression in microarray experiments”. Stat Appl Genet Mol Biol. 2004 http://www.ncbi.nlm.nih.gov/pubmed/16646809
We learned how the sample mean and SD are susceptible to outliers. The t-test is based on these measures and is susceptible as well. The Wilcoxon rank test (equivalent to the Mann-Whitney test) provides an alternative. In the code below, we perform a t-test on data for which the null is true. However, we change one sum observation by mistake in each sample and the values incorrectly entered are different. Here we see that the t-test results in a small p-value, while the Wilcoxon test does not:
set.seed(779) ##779 picked for illustration purposes
N=25
x<- rnorm(N,0,1)
y<- rnorm(N,0,1)
Create outliers:
x[1] <- 5
x[2] <- 7
cat("t-test pval:",t.test(x,y)$p.value)
## t-test pval: 0.04439948
cat("Wilcox test pval:",wilcox.test(x,y)$p.value)
## Wilcox test pval: 0.1310212
The basic idea is to 1) combine all the data, 2) turn the values into ranks, 3) separate them back into their groups, and 4) compute the sum or average rank and perform a test.
library(rafalib)
mypar(1,2)
stripchart(list(x,y),vertical=TRUE,ylim=c(-7,7),ylab="Observations",pch=21,bg=1)
abline(h=0)
xrank<-rank(c(x,y))[seq(along=x)]
yrank<-rank(c(x,y))[-seq(along=x)]
stripchart(list(xrank,yrank),vertical=TRUE,ylab="Ranks",pch=21,bg=1,cex=1.25)
ws <- sapply(x,function(z) rank(c(z,y))[1]-1)
text( rep(1.05,length(ws)), xrank, ws, cex=0.8)
Data from two populations with two outliers. The left plot shows the original data and the right plot shows their ranks. The numbers are the w values
W <-sum(ws)
W is the sum of the ranks for the first group relative
to the second group. We can compute an exact p-value for \(W\) based on combinatorics. We can also use
the CLT since statistical theory tells us that this W is
approximated by the normal distribution. We can construct a z-score as
follows:
n1<-length(x);n2<-length(y)
Z <- (mean(ws)-n2/2)/ sqrt(n2*(n1+n2+1)/12/n1)
print(Z)
## [1] 1.523124
Here the Z is not large enough to give us a p-value less
than 0.05. These are part of the calculations performed by the R
function wilcox.test.
This chapter introduces the statistical concepts necessary to understand p-values and confidence intervals. These terms are ubiquitous in the life science literature. Let’s use this paper as an example.
Note that the abstract has this statement:
“Body weight was higher in mice fed the high-fat diet already after the first week, due to higher dietary intake in combination with lower metabolic efficiency.”
To support this claim they provide the following in the results section:
“Already during the first week after introduction of high-fat diet, body weight increased significantly more in the high-fat diet-fed mice (\(+\) 1.6 \(\pm\) 0.1 g) than in the normal diet-fed mice (\(+\) 0.2 \(\pm\) 0.1 g; P < 0.001).”
What does P < 0.001 mean? What are the \(\pm\) included? We will learn what this means and learn to compute these values in R. The first step is to understand random variables. To do this, we will use data from a mouse database (provided by Karen Svenson via Gary Churchill and Dan Gatti and partially funded by P50 GM070683). We will import the data into R and explain random variables and null distributions using R programming.
If you already downloaded the femaleMiceWeights file
into your working directory, you can read it into R with just one
line:
dat <- read.csv("femaleMiceWeights.csv")
Remember that a quick way to read the data, without downloading it is by using the url:
dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
filename <- "femaleMiceWeights.csv"
url <- paste0(dir, filename)
dat <- read.csv(url)
We are interested in determining if following a given diet makes mice
heavier after several weeks. This data was produced by ordering 24 mice
from The Jackson Lab and randomly assigning either chow or high fat (hf)
diet. After several weeks, the scientists weighed each mouse and
obtained this data (head just shows us the first 6
rows):
head(dat)
## Diet Bodyweight
## 1 chow 21.51
## 2 chow 28.14
## 3 chow 24.04
## 4 chow 23.45
## 5 chow 23.68
## 6 chow 19.79
In RStudio, you can view the entire dataset with:
View(dat)
So are the hf mice heavier? Mouse 24 at 20.73 grams is one of the lightest mice, while Mouse 21 at 34.02 grams is one of the heaviest. Both are on the hf diet. Just from looking at the data, we see there is variability. Claims such as the one above usually refer to the averages. So let’s look at the average of each group:
library(dplyr)
control <- filter(dat,Diet=="chow") %>% select(Bodyweight) %>% unlist
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) %>% unlist
print( mean(treatment) )
## [1] 26.83417
print( mean(control) )
## [1] 23.81333
obsdiff <- mean(treatment) - mean(control)
print(obsdiff)
## [1] 3.020833
So the hf diet mice are about 10% heavier. Are we done? Why do we need p-values and confidence intervals? The reason is that these averages are random variables. They can take many values.
If we repeat the experiment, we obtain 24 new mice from The Jackson Laboratory and, after randomly assigning them to each diet, we get a different mean. Every time we repeat this experiment, we get a different value. We call this type of quantity a random variable.
Let’s explore random variables further. Imagine that we actually have the weight of all control female mice and can upload them to R. In Statistics, we refer to this as the population. These are all the control mice available from which we sampled 24. Note that in practice we do not have access to the population. We have a special dataset that we are using here to illustrate concepts.
The first step is to download the data from here into your working directory and then read it into R:
population <- read.csv("femaleControlsPopulation.csv")
##use unlist to turn it into a numeric vector
population <- unlist(population)
Now let’s sample 12 mice three times and see how the average changes.
control <- sample(population,12)
mean(control)
## [1] 23.47667
control <- sample(population,12)
mean(control)
## [1] 25.12583
control <- sample(population,12)
mean(control)
## [1] 23.72333
Note how the average varies. We can continue to do this repeatedly and start learning something about the distribution of this random variable.
Now let’s go back to our average difference of obsdiff.
As scientists we need to be skeptics. How do we know that this
obsdiff is due to the diet? What happens if we give all 24
mice the same diet? Will we see a difference this big? Statisticians
refer to this scenario as the null hypothesis. The name “null”
is used to remind us that we are acting as skeptics: we give credence to
the possibility that there is no difference.
Because we have access to the population, we can actually observe as many values as we want of the difference of the averages when the diet has no effect. We can do this by randomly sampling 24 control mice, giving them the same diet, and then recording the difference in mean between two randomly split groups of 12 and 12. Here is this process written in R code:
##12 control mice
control <- sample(population,12)
##another 12 control mice that we act as if they were not
treatment <- sample(population,12)
print(mean(treatment) - mean(control))
## [1] -0.2391667
Now let’s do it 10,000 times. We will use a “for-loop”, an operation
that lets us automate this (a simpler approach that, we will learn
later, is to use replicate).
n <- 10000
null <- vector("numeric",n)
for (i in 1:n) {
control <- sample(population,12)
treatment <- sample(population,12)
null[i] <- mean(treatment) - mean(control)
}
The values in null form what we call the null
distribution. We will define this more formally below.
So what percent of the 10,000 are bigger than
obsdiff?
mean(null >= obsdiff)
## [1] 0.0123
Only a small percent of the 10,000 simulations. As skeptics what do we conclude? When there is no diet effect, we see a difference as big as the one we observed only 1.5% of the time. This is what is known as a p-value, which we will define more formally later in the book.
We have explained what we mean by null in the context of null hypothesis, but what exactly is a distribution? The simplest way to think of a distribution is as a compact description of many numbers. For example, suppose you have measured the heights of all men in a population. Imagine you need to describe these numbers to someone that has no idea what these heights are, such as an alien that has never visited Earth. Suppose all these heights are contained in the following dataset:
One approach to summarizing these numbers is to simply list them all out for the alien to see. Here are 10 randomly selected heights of 1,078:
Scanning through these numbers, we start to get a rough idea of what the entire list looks like, but it is certainly inefficient. We can quickly improve on this approach by defining and visualizing a distribution. To define a distribution we compute, for all possible values of \(a\), the proportion of numbers in our list that are below \(a\). We use the following notation:
\[ F(a) \equiv \mbox{Pr}(x \leq a) \]
This is called the cumulative distribution function (CDF). When the CDF is derived from data, as opposed to theoretically, we also call it the empirical CDF (ECDF). The ECDF for the height data looks like this:
Although the empirical CDF concept is widely discussed in statistics textbooks, the plot is actually not very popular in practice. The reason is that histograms give us the same information and are easier to interpret. Histograms show us the proportion of values in intervals:
\[ \mbox{Pr}(a \leq x \leq b) = F(b) - F(a) \]
Plotting these heights as bars is what we call a histogram. It is a more useful plot because we are usually more interested in intervals, such and such percent are between 70 inches and 71 inches, etc., rather than the percent less than a particular height. It is also easier to distinguish different types (families) of distributions by looking at histograms. Here is a histogram of heights:
We can specify the bins and add better labels in the following way:
Showing this plot to the alien is much more informative than showing numbers. With this simple plot, we can approximate the number of individuals in any given interval. For example, there are about 70 individuals over six feet (72 inches) tall.
Summarizing lists of numbers is one powerful use of distribution. An even more important use is describing the possible outcomes of a random variable. Unlike a fixed list of numbers, we don’t actually observe all possible outcomes of random variables, so instead of describing proportions, we describe probabilities. For instance, if we pick a random height from our list, then the probability of it falling between \(a\) and \(b\) is denoted with:
\[ \mbox{Pr}(a \leq X \leq b) = F(b) - F(a) \]
Note that the \(X\) is now capitalized to distinguish it as a random variable and that the equation above defines the probability distribution of the random variable. Knowing this distribution is incredibly useful in science. For example, in the case above, if we know the distribution of the difference in mean of mouse weights when the null hypothesis is true, referred to as the null distribution, we can compute the probability of observing a value as large as we did, referred to as a p-value. In a previous section we ran what is called a Monte Carlo simulation (we will provide more details on Monte Carlo simulation in a later section) and we obtained 10,000 outcomes of the random variable under the null hypothesis. Let’s repeat the loop above, but this time let’s add a point to the figure every time we re-run the experiment. If you run this code, you can see the null distribution forming as the observed values stack on top of each other.
n <- 100
library(rafalib)
nullplot(-5,5,1,30, xlab="Observed differences (grams)", ylab="Frequency")
totals <- vector("numeric",11)
for (i in 1:n) {
control <- sample(population,12)
treatment <- sample(population,12)
nulldiff <- mean(treatment) - mean(control)
j <- pmax(pmin(round(nulldiff)+6,11),1)
totals[j] <- totals[j]+1
text(j-6,totals[j],pch=15,round(nulldiff,1))
##if(i < 15) Sys.sleep(1) ##You can add this line to see values appear slowly
}
Illustration of the null distribution.
The figure above amounts to a histogram. From a histogram of the
null vector we calculated earlier, we can see that values
as large as obsdiff are relatively rare:
hist(null, freq=TRUE)
abline(v=obsdiff, col="red", lwd=2)
Null distribution with observed difference marked with vertical red line.
An important point to keep in mind here is that while we defined \(\mbox{Pr}(a)\) by counting cases, we will learn that, in some circumstances, mathematics gives us formulas for \(\mbox{Pr}(a)\) that save us the trouble of computing them as we did here. One example of this powerful approach uses the normal distribution approximation.
The probability distribution we see above approximates one that is very common in nature: the bell curve, also known as the normal distribution or Gaussian distribution. When the histogram of a list of numbers approximates the normal distribution, we can use a convenient mathematical formula to approximate the proportion of values or outcomes in any given interval:
\[ \mbox{Pr}(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left( \frac{-(x-\mu)^2}{2 \sigma^2} \right)} \, dx \]
While the formula may look intimidating, don’t worry, you will never
actually have to type it out, as it is stored in a more convenient form
(as pnorm in R which sets a to \(-\infty\), and takes b as an
argument).
Here \(\mu\) and \(\sigma\) are referred to as the mean and
the standard deviation of the population (we explain these in more
detail in another section). If this normal approximation holds
for our list, then the population mean and variance of our list can be
used in the formula above. An example of this would be when we noted
above that only 1.5% of values on the null distribution were above
obsdiff. We can compute the proportion of values below a
value x with pnorm(x,mu,sigma) without knowing
all the values. The normal approximation works very well here:
1 - pnorm(obsdiff,mean(null),sd(null))
## [1] 0.01311009
Later, we will learn that there is a mathematical explanation for this. A very useful characteristic of this approximation is that one only needs to know \(\mu\) and \(\sigma\) to describe the entire distribution. From this, we can compute the proportion of values in any interval.
So computing a p-value for the difference in diet for the mice was pretty easy, right? But why are we not done? To make the calculation, we did the equivalent of buying all the mice available from The Jackson Laboratory and performing our experiment repeatedly to define the null distribution. Yet this is not something we can do in practice. Statistical Inference is the mathematical theory that permits you to approximate this with only the data from your sample, i.e. the original 24 mice. We will focus on this in the following sections.
Before we continue, we briefly explain the following important line of code:
set.seed(1)
Throughout this book, we use random number generators. This implies that many of the results presented can actually change by chance, including the correct answer to problems. One way to ensure that results do not change is by setting R’s random number generation seed. For more on the topic please read the help file:
?set.seed
We have used the example of the effects of two different diets on the weight of mice. Since in this illustrative example we have access to the population, we know that in fact there is a substantial (about 10%) difference between the average weights of the two populations:
library(dplyr)
dat <- read.csv("mice_pheno.csv") #Previously downloaded
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%
select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%
select(Bodyweight) %>% unlist
mu_hf <- mean(hfPopulation)
mu_control <- mean(controlPopulation)
print(mu_hf - mu_control)
## [1] 2.375517
print((mu_hf - mu_control)/mu_control * 100) #percent increase
## [1] 9.942157
We have also seen that, in some cases, when we take a sample and perform a t-test, we don’t always get a p-value smaller than 0.05. For example, here is a case where we take a sample of 5 mice and don’t achieve statistical significance at the 0.05 level:
set.seed(1)
N <- 5
hf <- sample(hfPopulation,N)
control <- sample(controlPopulation,N)
t.test(hf,control)$p.value
## [1] 0.5806661
Did we make a mistake? By not rejecting the null hypothesis, are we saying the diet has no effect? The answer to this question is no. All we can say is that we did not reject the null hypothesis. But this does not necessarily imply that the null is true. The problem is that, in this particular instance, we don’t have enough power, a term we are now going to define. If you are doing scientific research, it is very likely that you will have to do a power calculation at some point. In many cases, it is an ethical obligation as it can help you avoid sacrificing mice unnecessarily or limiting the number of human subjects exposed to potential risk in a study. Here we explain what statistical power means.
Whenever we perform a statistical test, we are aware that we may make a mistake. This is why our p-values are not 0. Under the null, there is always a positive, perhaps very small, but still positive chance that we will reject the null when it is true. If the p-value is 0.05, it will happen 1 out of 20 times. This error is called type I error by statisticians.
A type I error is defined as rejecting the null when we should not. This is also referred to as a false positive. So why do we then use 0.05? Shouldn’t we use 0.000001 to be really sure? The reason we don’t use infinitesimal cut-offs to avoid type I errors at all cost is that there is another error we can commit: to not reject the null when we should. This is called a type II error or a false negative. The R code analysis above shows an example of a false negative: we did not reject the null hypothesis (at the 0.05 level) and, because we happen to know and peeked at the true population means, we know there is in fact a difference. Had we used a p-value cutoff of 0.25, we would not have made this mistake. However, in general, are we comfortable with a type I error rate of 1 in 4? Usually we are not.
Most journals and regulatory agencies frequently insist that results be significant at the 0.01 or 0.05 levels. Of course there is nothing special about these numbers other than the fact that some of the first papers on p-values used these values as examples. Part of the goal of this book is to give readers a good understanding of what p-values and confidence intervals are so that these choices can be judged in an informed way. Unfortunately, in science, these cut-offs are applied somewhat mindlessly, but that topic is part of a complicated debate.
Power is the probability of rejecting the null when the null is
false. Of course “when the null is false” is a complicated statement
because it can be false in many ways. \(\Delta
\equiv \mu_Y - \mu_X\) could be anything and the power actually
depends on this parameter. It also depends on the standard error of your
estimates which in turn depends on the sample size and the population
standard deviations. In practice, we don’t know these so we usually
report power for several plausible values of \(\Delta\), \(\sigma_X\), \(\sigma_Y\) and various sample sizes.
Statistical theory gives us formulas to calculate power. The
pwr package performs these calculations for you. Here we
will illustrate the concepts behind power by coding up simulations in
R.
Suppose our sample size is:
N <- 12
and we will reject the null hypothesis at:
alpha <- 0.05
What is our power with this particular data? We will compute this probability by re-running the exercise many times and calculating the proportion of times the null hypothesis is rejected. Specifically, we will run:
B <- 2000
simulations. The simulation is as follows: we take a sample of size
\(N\) from both control and treatment
groups, we perform a t-test comparing these two, and report if the
p-value is less than alpha or not. We write a function that
does this:
reject <- function(N, alpha=0.05){
hf <- sample(hfPopulation,N)
control <- sample(controlPopulation,N)
pval <- t.test(hf,control)$p.value
pval < alpha
}
Here is an example of one simulation for a sample size of 12. The
call to reject answers the question “Did we reject?”
reject(12)
## [1] TRUE
Now we can use the replicate function to do this
B times.
rejections <- replicate(B,reject(N))
Our power is just the proportion of times we correctly reject. So with \(N=12\) our power is only:
mean(rejections)
## [1] 0.2155
This explains why the t-test was not rejecting when we knew the null was false. With a sample size of just 12, our power is about 23%. To guard against false positives at the 0.05 level, we had set the threshold at a high enough level that resulted in many type II errors.
Let’s see how power improves with N. We will use the function
sapply, which applies a function to each of the elements of
a vector. We want to repeat the above for the following sample size:
Ns <- seq(5, 50, 5)
So we use apply like this:
power <- sapply(Ns,function(N){
rejections <- replicate(B, reject(N))
mean(rejections)
})
For each of the three simulations, the above code returns the proportion of times we reject. Not surprisingly power increases with N:
plot(Ns, power, type="b")
Power plotted against sample size.
Similarly, if we change the level alpha at which we
reject, power changes. The smaller I want the chance of type I error to
be, the less power I will have. Another way of saying this is that we
trade off between the two types of error. We can see this by writing
similar code, but keeping \(N\) fixed
and considering several values of alpha:
N <- 30
alphas <- c(0.1,0.05,0.01,0.001,0.0001)
power <- sapply(alphas,function(alpha){
rejections <- replicate(B,reject(N,alpha=alpha))
mean(rejections)
})
plot(alphas, power, xlab="alpha", type="b", log="x")
Power plotted against cut-off.
Note that the x-axis in this last plot is in the log scale.
There is no “right” power or “right” alpha level, but it is important that you understand what each means.
To see this clearly, you could create a plot with curves of power versus N. Show several curves in the same plot with color representing alpha level.
Another consequence of what we have learned about power is that p-values are somewhat arbitrary when the null hypothesis is not true and therefore the alternative hypothesis is true (the difference between the population means is not zero). When the alternative hypothesis is true, we can make a p-value as small as we want simply by increasing the sample size (supposing that we have an infinite population to sample from). We can show this property of p-values by drawing larger and larger samples from our population and calculating p-values. This works because, in our case, we know that the alternative hypothesis is true, since we have access to the populations and can calculate the difference in their means.
First write a function that returns a p-value for a given sample size \(N\):
calculatePvalue <- function(N) {
hf <- sample(hfPopulation,N)
control <- sample(controlPopulation,N)
t.test(hf,control)$p.value
}
We have a limit here of 200 for the high-fat diet population, but we can see the effect well before we get to 200. For each sample size, we will calculate a few p-values. We can do this by repeating each value of \(N\) a few times.
Ns <- seq(10,200,by=10)
Ns_rep <- rep(Ns, each=10)
Again we use sapply to run our simulations:
pvalues <- sapply(Ns_rep, calculatePvalue)
Now we can plot the 10 p-values we generated for each sample size:
plot(Ns_rep, pvalues, log="y", xlab="sample size",
ylab="p-values")
abline(h=c(.01, .05), col="red", lwd=2)
p-values from random samples at varying sample size. The actual value of the p-values decreases as we increase sample size whenever the alternative hypothesis is true.
Note that the y-axis is log scale and that the p-values show a decreasing trend all the way to \(10^{-8}\) as the sample size gets larger. The standard cutoffs of 0.01 and 0.05 are indicated with horizontal red lines.
It is important to remember that p-values are not more interesting as they become very very small. Once we have convinced ourselves to reject the null hypothesis at a threshold we find reasonable, having an even smaller p-value just means that we sampled more mice than was necessary. Having a larger sample size does help to increase the precision of our estimate of the difference \(\Delta\), but the fact that the p-value becomes very very small is just a natural consequence of the mathematics of the test. The p-values get smaller and smaller with increasing sample size because the numerator of the t-statistic has \(\sqrt{N}\) (for equal sized groups, and a similar effect occurs when \(M \neq N\)). Therefore, if \(\Delta\) is non-zero, the t-statistic will increase with \(N\).
Therefore, a better statistic to report is the effect size with a confidence interval or some statistic which gives the reader a sense of the change in a meaningful scale. We can report the effect size as a percent by dividing the difference and the confidence interval by the control population mean:
N <- 12
hf <- sample(hfPopulation, N)
control <- sample(controlPopulation, N)
diff <- mean(hf) - mean(control)
diff / mean(control) * 100
## [1] 5.196293
t.test(hf, control)$conf.int / mean(control) * 100
## [1] -7.775307 18.167892
## attr(,"conf.level")
## [1] 0.95
In addition, we can report a statistic called Cohen’s d, which is the difference between the groups divided by the pooled standard deviation of the two groups.
sd_pool <- sqrt(((N-1)*var(hf) + (N-1)*var(control))/(2*N - 2))
diff / sd_pool
## [1] 0.3397886
This tells us how many standard deviations of the data the mean of the high-fat diet group is from the control group. Under the alternative hypothesis, unlike the t-statistic which is guaranteed to increase, the effect size and Cohen’s d will become more precise.
In this unit we will show an example of analyzing methylation data. We will use colon cancer data from TCGA. The data was created with the Illumina 450K array and we have already processed the raw data to create matrix with methylation measurements. The script that creates these ojects is here: https://github.com/genomicsclass/labs/blob/master/Rscripts/read_tcga_meth.R
Let’s begin by loading the data
# devtools::install_github("genomicsclass/coloncancermeth")
library(S4Vectors)
library(coloncancermeth)
data(coloncancermeth)
We know have three tables one containing the methylation data, one with information about the samples or columns of the data matrix, and granges object with the genomic location of the CpGs represetned in the rows of the data matrix
dim(meth) ##this is the methylation data
## [1] 485512 26
dim(pd) ##this is sample information
## [1] 26 105
length(gr)
## Loading required package: GenomicRanges
## [1] 485512
The pd object includes clinical information. One the
columns tells us if the sample is from colon cancer or from normal
tissue
colnames(pd)
## [1] "bcr_patient_barcode"
## [2] "bcr_sample_uuid"
## [3] "bcr_sample_barcode"
## [4] "NCNNCT_OthMethONSP"
## [5] "current_weight"
## [6] "days_to_collection"
## [7] "days_to_sample_procurement"
## [8] "freezing_method"
## [9] "initial_weight"
## [10] "intermediate_dimension"
## [11] "longest_dimension"
## [12] "lymphatic_invasion"
## [13] "margins_involved"
## [14] "method_of_sample_procurement"
## [15] "number_regional_lymphnodes_exam"
## [16] "number_regional_lymphnodes_pos"
## [17] "oct_embedded"
## [18] "pathology_report_uuid"
## [19] "primary_or_metastatic_status"
## [20] "sample_type"
## [21] "sample_type_id"
## [22] "shortest_dimension"
## [23] "time_between_clamping_and_freezing"
## [24] "time_between_excision_and_freezing"
## [25] "venous_invasion"
## [26] "verification_by_bcr"
## [27] "vial_number.sample"
## [28] "bcr_patient_barcode.tumor"
## [29] "tumor_necrosis_percent"
## [30] "tumor_nuclei_percent"
## [31] "tumor_weight"
## [32] "vial_number.tumor"
## [33] "bcr_patient_barcode.normal"
## [34] "days_to_normal_sample_procurement"
## [35] "method_of_normal_sample_procurement"
## [36] "normal_control_type"
## [37] "normal_tissue_anatomic_site"
## [38] "normal_tissue_proximity"
## [39] "vial_number"
## [40] "ncedna_dna_conc"
## [41] "ncedna_dna_qm"
## [42] "ncedna_dna_qty"
## [43] "ncedna_dna_vol"
## [44] "patient.age_at_initial_pathologic_diagnosis"
## [45] "patient.ajcc_cancer_staging_handbook_edition"
## [46] "patient.anatomic_organ_subdivision"
## [47] "patient.anatomic_site_colorectal"
## [48] "patient.bcr_patient_uuid"
## [49] "patient.braf_gene_analysis_performed"
## [50] "patient.braf_gene_analysis_result"
## [51] "patient.circumferential_resection_margin"
## [52] "patient.colon_polyps_present"
## [53] "patient.date_of_form_completion"
## [54] "patient.date_of_initial_pathologic_diagnosis"
## [55] "patient.days_to_birth"
## [56] "patient.days_to_death"
## [57] "patient.days_to_initial_pathologic_diagnosis"
## [58] "patient.days_to_last_followup"
## [59] "patient.days_to_last_known_alive"
## [60] "patient.distant_metastasis_pathologic_spread"
## [61] "patient.ethnicity"
## [62] "patient.gender"
## [63] "patient.height"
## [64] "patient.histological_type"
## [65] "patient.history_of_colon_polyps"
## [66] "patient.icd_10"
## [67] "patient.icd_o_3_histology"
## [68] "patient.icd_o_3_site"
## [69] "patient.informed_consent_verified"
## [70] "patient.kras_gene_analysis_performed"
## [71] "patient.kras_mutation_codon"
## [72] "patient.kras_mutation_found"
## [73] "patient.loss_expression_of_mismatch_repair_proteins_by_ihc"
## [74] "patient.loss_expression_of_mismatch_repair_proteins_by_ihc_result"
## [75] "patient.lymph_node_examined_count"
## [76] "patient.lymphatic_invasion"
## [77] "patient.lymphnode_pathologic_spread"
## [78] "patient.microsatellite_instability"
## [79] "patient.non_nodal_tumor_deposits"
## [80] "patient.number_of_abnormal_loci"
## [81] "patient.number_of_first_degree_relatives_with_cancer_diagnosis"
## [82] "patient.number_of_loci_tested"
## [83] "patient.number_of_lymphnodes_positive_by_he"
## [84] "patient.number_of_lymphnodes_positive_by_ihc"
## [85] "patient.patient_id"
## [86] "patient.perineural_invasion_present"
## [87] "patient.person_neoplasm_cancer_status"
## [88] "patient.preoperative_pretreatment_cea_level"
## [89] "patient.pretreatment_history"
## [90] "patient.primary_lymph_node_presentation_assessment"
## [91] "patient.primary_tumor_pathologic_spread"
## [92] "patient.prior_diagnosis"
## [93] "patient.race"
## [94] "patient.residual_tumor"
## [95] "patient.synchronous_colon_cancer_present"
## [96] "patient.tissue_source_site"
## [97] "patient.tumor_stage"
## [98] "patient.tumor_tissue_site"
## [99] "patient.venous_invasion"
## [100] "patient.vital_status"
## [101] "patient.weight"
## [102] "Basename"
## [103] "Status"
## [104] "Tissue"
## [105] "Sex"
table(pd$Status)
##
## normal cancer
## 9 17
normalIndex <- which(pd$Status=="normal")
cancerlIndex <- which(pd$Status=="cancer")
Let’s start by taking a quick look at the distribution of methylation measurements for the normal samples:
i=normalIndex[1]
plot(density(meth[,i],from=0,to=1),main="",ylim=c(0,3),type="n")
for(i in normalIndex){
lines(density(meth[,i],from=0,to=1),col=1)
}
### Add the cancer samples
for(i in cancerlIndex){
lines(density(meth[,i],from=0,to=1),col=2)
}
We are interested in finding regions of the genome that are different between cancer and normal samples. Furthermore, we want regions that are consistenly different therefore we can treat this as an inference problem. We can compute a t-statistic for each CpG:
library(limma)
X<-model.matrix(~pd$Status)
fit<-lmFit(meth,X)
eb <- eBayes(fit)
A volcano plot reveals many differences:
library(rafalib)
splot(fit$coef[,2],-log10(eb$p.value[,2]),xlab="Effect size",ylab="-log10 p-value")
If we have reason to believe for DNA methylation to have an effect on gene expression a region of the genome needs to be affected, not just a single CpG, we should look beyond. Here is plot of the region surrounding the top hit:
library(GenomicRanges)
i <- which.min(eb$p.value[,2])
middle <- gr[i,]
Index<-gr%over%(middle+10000)
cols=ifelse(pd$Status=="normal",1,2)
chr=as.factor(seqnames(gr))
pos=start(gr)
plot(pos[Index],fit$coef[Index,2],type="b",xlab="genomic location",ylab="difference")
matplot(pos[Index],meth[Index,],col=cols,xlab="genomic location")
We can search for these regions explicitly instead of searching for single points, as explained by Jaffe and Irizarry (2012) [http://www.ncbi.nlm.nih.gov/pubmed/22422453].
If we are going to perform regional analysis we first have to define a region. But one issue is that not only do we have to separate the analysis by chromosome but that within each chromosome we usually have big gaps creating subgroups of regions to be analyzed.
chr1Index <- which(chr=="chr1")
hist(log10(diff(pos[chr1Index])),main="",xlab="log 10 method")
We can create groups in the following way.
# BiocManager::install("bumphunter")
library(bumphunter)
cl=clusterMaker(chr,pos,maxGap=500)
table(table(cl)) ##shows the number of regions with 1,2,3, ... points in them
##
## 1 2 3 4 5 6 7 8 9 10 11
## 141457 18071 13227 6473 5144 3748 2517 2135 2029 1878 1792
## 12 13 14 15 16 17 18 19 20 21 22
## 1570 1269 933 684 472 337 240 181 99 113 62
## 23 24 25 26 27 28 29 30 31 32 33
## 57 42 36 39 26 17 21 19 12 12 12
## 34 35 36 37 38 39 40 41 42 43 44
## 7 7 12 3 9 4 5 7 7 9 7
## 45 46 48 49 50 51 52 53 54 55 56
## 7 8 3 5 5 2 1 5 2 4 1
## 57 58 59 60 61 62 63 64 65 67 68
## 1 3 2 1 4 3 4 1 2 2 1
## 70 71 73 74 76 78 80 82 83 85 87
## 2 1 2 2 2 1 3 2 1 2 1
## 88 89 90 91 92 93 112 117 137 141 181
## 2 1 1 1 1 1 2 1 1 1 1
Now let’s consider two example regions:
###Select the region with the smallest value
Index<- which(cl==cl[which.min(fit$coef[,2])])
matplot(pos[Index],meth[Index,],col=cols,pch=1,xlab="genomic location",ylab="methylation")
x1=pos[Index]
y1=fit$coef[Index,2]
plot(x1,y1,xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
abline(h=0,lty=2)
abline(h=c(-.1,.1),lty=2)
This region shows only a single CpG as different. In contrast, notice this region:
Index=which(cl==72201) ##we know this is a good example from analysis we have already performed
matplot(pos[Index],meth[Index,],col=cols,pch=1,xlab="genomic location",ylab="methylation")
x2=pos[Index]
y2=fit$coef[Index,2]
plot(x2,y2,xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
abline(h=0,lty=2)
abline(h=c(-.1,.1),lty=2)
If we are interested in prioritizing regions over single points, we need an alternative approach. If we assume that the real signal is smooth, we could use statistical smoothing techniques such as loess. Here is an example two regions above
lfit <- loess(y1~x1,degree=1,family="symmetric",span=1/2)
plot(x1,y1,xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
abline(h=c(-.1,0,.1),lty=2)
lines(x1,lfit$fitted,col=2)
lfit <- loess(y2~x2,degree=1,family="symmetric",span=1/2)
plot(x2,y2,xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
abline(h=c(-.1,0,.1),lty=2)
lines(x2,lfit$fitted,col=2)
The bumphunter automates this procedure:
res<-bumphunter(meth,X,chr=chr,pos=pos,cluster=cl,cutoff=0.1,B=0)
## [bumphunterEngine] Using a single core (backend: doSEQ, version: 1.5.2).
## [bumphunterEngine] Computing coefficients.
## [bumphunterEngine] Finding regions.
## [bumphunterEngine] Found 68682 bumps.
tab<-res$table
We now have a list of regions instead of single points. Here we look at the region with the highest rank if we order by area:
Index=(tab[1,7]-3):(tab[1,8]+3)
matplot(pos[Index],meth[Index,,drop=TRUE],col=cols,pch=1,xlab="genomic location",ylab="Methylation",ylim=c(0,1))
plot(pos[Index],res$fitted[Index,1],xlab="genomic location",ylab="Methylation difference",ylim=c(-1,1))
abline(h=c(-0.1,0,.1),lty=2)
The function also allows from smoothing and permutation based inference for the regions. However, we do not recommend running the function with these options without the ability to parallelize.
First we load an example data frame:
rats <- data.frame(id = paste0("rat",1:10),
sex = factor(rep(c("female","male"),each=5)),
weight = c(2,4,1,11,18,12,7,12,19,20),
length = c(100,105,115,130,95,150,165,180,190,175))
rats
## id sex weight length
## 1 rat1 female 2 100
## 2 rat2 female 4 105
## 3 rat3 female 1 115
## 4 rat4 female 11 130
## 5 rat5 female 18 95
## 6 rat6 male 12 150
## 7 rat7 male 7 165
## 8 rat8 male 12 180
## 9 rat9 male 19 190
## 10 rat10 male 20 175
The summary and str functions are two
helpful functions for getting a sense of data. summary
works on vectors or matrix-like objects (including data.frames).
str works on an arbitrary R object and will compactly
display the structure.
summary(rats)
## id sex weight length
## Length:10 female:5 Min. : 1.00 Min. : 95.0
## Class :character male :5 1st Qu.: 4.75 1st Qu.:107.5
## Mode :character Median :11.50 Median :140.0
## Mean :10.60 Mean :140.5
## 3rd Qu.:16.50 3rd Qu.:172.5
## Max. :20.00 Max. :190.0
summary(rats$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.75 11.50 10.60 16.50 20.00
str(rats)
## 'data.frame': 10 obs. of 4 variables:
## $ id : chr "rat1" "rat2" "rat3" "rat4" ...
## $ sex : Factor w/ 2 levels "female","male": 1 1 1 1 1 2 2 2 2 2
## $ weight: num 2 4 1 11 18 12 7 12 19 20
## $ length: num 100 105 115 130 95 150 165 180 190 175
We load another example data frame, with the original ID and another secret ID. Suppose we want to sort the original data frame by the secret ID.
ratsTable <- data.frame(id = paste0("rat",c(6,9,7,3,5,1,10,4,8,2)),
secretID = 1:10)
ratsTable
## id secretID
## 1 rat6 1
## 2 rat9 2
## 3 rat7 3
## 4 rat3 4
## 5 rat5 5
## 6 rat1 6
## 7 rat10 7
## 8 rat4 8
## 9 rat8 9
## 10 rat2 10
# wrong!
cbind(rats, ratsTable)
## id sex weight length id secretID
## 1 rat1 female 2 100 rat6 1
## 2 rat2 female 4 105 rat9 2
## 3 rat3 female 1 115 rat7 3
## 4 rat4 female 11 130 rat3 4
## 5 rat5 female 18 95 rat5 5
## 6 rat6 male 12 150 rat1 6
## 7 rat7 male 7 165 rat10 7
## 8 rat8 male 12 180 rat4 8
## 9 rat9 male 19 190 rat8 9
## 10 rat10 male 20 175 rat2 10
match is a very useful function in R. It can give us
this order, but it’s also easy to get its arguments mixed up. Remember
that match gives you, for each element in the first vector,
the index of the first match in the second vector. So typically the
data.frame or vector you are reordering would appear as the second
argument to match. It’s always a good idea to check that
you got it right, which you can do by using cbind to line
up both data frames.
match(ratsTable$id, rats$id)
## [1] 6 9 7 3 5 1 10 4 8 2
rats[match(ratsTable$id, rats$id),]
## id sex weight length
## 6 rat6 male 12 150
## 9 rat9 male 19 190
## 7 rat7 male 7 165
## 3 rat3 female 1 115
## 5 rat5 female 18 95
## 1 rat1 female 2 100
## 10 rat10 male 20 175
## 4 rat4 female 11 130
## 8 rat8 male 12 180
## 2 rat2 female 4 105
cbind(rats[match(ratsTable$id, rats$id),], ratsTable)
## id sex weight length id secretID
## 6 rat6 male 12 150 rat6 1
## 9 rat9 male 19 190 rat9 2
## 7 rat7 male 7 165 rat7 3
## 3 rat3 female 1 115 rat3 4
## 5 rat5 female 18 95 rat5 5
## 1 rat1 female 2 100 rat1 6
## 10 rat10 male 20 175 rat10 7
## 4 rat4 female 11 130 rat4 8
## 8 rat8 male 12 180 rat8 9
## 2 rat2 female 4 105 rat2 10
Or you can use the merge function which will handle
everything for you. You can tell it the names of the columns to merge
on, or it will look for columns with the same name.
ratsMerged <- merge(rats, ratsTable, by.x="id", by.y="id")
ratsMerged[order(ratsMerged$secretID),]
## id sex weight length secretID
## 7 rat6 male 12 150 1
## 10 rat9 male 19 190 2
## 8 rat7 male 7 165 3
## 4 rat3 female 1 115 4
## 6 rat5 female 18 95 5
## 1 rat1 female 2 100 6
## 2 rat10 male 20 175 7
## 5 rat4 female 11 130 8
## 9 rat8 male 12 180 9
## 3 rat2 female 4 105 10
Suppose we need to calculate the average rat weight for each sex. We
could start by splitting the weight vector into a list of weight vectors
divided by sex. split is a useful function for breaking up
a vector into groups defined by a second vector, typically a factor. We
can then use the lapply function to calculate the average
of each element of the list, which are vectors of weights.
sp <- split(rats$weight, rats$sex)
sp
## $female
## [1] 2 4 1 11 18
##
## $male
## [1] 12 7 12 19 20
lapply(sp, mean)
## $female
## [1] 7.2
##
## $male
## [1] 14
A shortcut for this is to use tapply and give the
function, which should run on each element of the list, as a third
argument:
tapply(rats$weight, rats$sex, mean)
## female male
## 7.2 14.0
R is constantly being developed in the form of add-on packages, which
can sometimes greatly simplify basic analysis tasks. A new library
“dplyr” can accomplish the same task as above, and can be extended to
many other, more complicated operations. The “d” in the name is for
data.frame, and the “ply” is because the library attempts to simplify
tasks typically used by the set of functions: sapply,
lapply, tapply, etc. Here is the same task as
before done with the dplyr functions group_by and
summarise:
library(dplyr)
sexes <- group_by(rats, sex)
summarise(sexes, ave=mean(weight))
## # A tibble: 2 × 2
## sex ave
## <fct> <dbl>
## 1 female 7.2
## 2 male 14
One of the most useful applications of projections relates to coordinate rotations. In data analysis, simple rotations can result in easier to visualize and interpret data. We will describe the mathematics behind rotations and give some data analysis examples.
In our previous section, we used the following example:
\[ Y = \begin{pmatrix} 2 \\ 3 \end{pmatrix} = 2 \begin{pmatrix} 1\\ 0 \end{pmatrix} + 3 \begin{pmatrix} 0\\ 1 \end{pmatrix} \]
and noted that \(2\) and \(3\) are the coordinates.
library(rafalib)
mypar()
plot(c(-2,4),c(-2,4),xlab="Dimension 1",ylab="Dimension 2",
type="n",xaxt="n",yaxt="n",bty="n")
text(rep(0,6),c(c(-2,-1),c(1:4)),as.character(c(c(-2,-1),c(1:4))),pos=2)
text(c(c(-2,-1),c(1:4)),rep(0,6),as.character(c(c(-2,-1),c(1:4))),pos=1)
abline(v=0,h=0)
arrows(0,0,2,3,lwd=3)
segments(2,0,2,3,lty=2)
segments(0,3,2,3,lty=2)
text(2,3," Y",pos=4,cex=3)
Plot of (2,3) as coordinates along Dimension 1 (1,0) and Dimension 2 (0,1).
However, mathematically we can represent the point \((2,3)\) with other linear combinations:
\[ \begin{align*} Y &= \begin{pmatrix} 2 \\ 3\end{pmatrix} \\ &= 2.5 \begin{pmatrix} 1\\ 1\end{pmatrix} + -1 \begin{pmatrix} \phantom{-}0.5\\ -0.5\end{pmatrix} \end{align*}\]
The new coordinates are:
\[Z = \begin{pmatrix} 2.5 \\ -1 \end{pmatrix}\]
Graphically, we can see that the coordinates are the projections to the spaces defined by the new basis:
library(rafalib)
mypar()
plot(c(-2,4),c(-2,4),xlab="Dimension 1",ylab="Dimension 2",
type="n",xaxt="n",yaxt="n",bty="n")
text(rep(0,6),c(c(-2,-1),c(1:4)),as.character(c(c(-2,-1),c(1:4))),pos=2)
text(c(c(-2,-1),c(1:4)),rep(0,6),as.character(c(c(-2,-1),c(1:4))),pos=1)
abline(v=0,h=0)
abline(0,1,col="red")
abline(0,-1,col="red")
arrows(0,0,2,3,lwd=3)
y=c(2,3)
x1=c(1,1)##new basis
x2=c(0.5,-0.5)##new basis
c1 = crossprod(x1,y)/crossprod(x1)
c2 = crossprod(x2,y)/crossprod(x2)
segments(x1[1]*c1,x1[2]*c1,y[1],y[2],lty=2)
segments(x2[1]*c2,x2[2]*c2,y[1],y[2],lty=2)
text(2,3," Y",pos=4,cex=3)
Plot of (2,3) as a vector in a rotatated space, relative to the original dimensions.
We can go back and forth between these two representations of \((2,3)\) using matrix multiplication.
\[ Y = AZ\\ \]
\[ A^{-1} Y = Z\\ \]
\[ A= \begin{pmatrix} 1& \phantom{-}0.5\\ 1 & -0.5\end{pmatrix} \implies A^{-1}= \begin{pmatrix} 0.5& 0.5 \\ 1 &-1\end{pmatrix} \]
\(Z\) and \(Y\) carry the same information, but in a different coordinate system.
Here are 100 two dimensional points \(Y\)
Twin 2 heights versus twin 1 heights.
Here are the rotations: \(Z = A^{-1} Y\)
Rotation of twin 2 heights versus twin 1 heights.
What we have done here is rotate the data so that the first coordinate of \(Z\) is the average height, while the second is the difference between twin heights.
We have used the singular value decomposition to find principal components. It is sometimes useful to think of the SVD as a rotation, for example \(\mathbf{U}^\top \mathbf{Y}\), that gives us a new coordinate system \(\mathbf{DV}^\top\) in which the dimensions are ordered by how much variance they explain.
Now that we have introduced the idea of a random variable, a null distribution, and a p-value, we are ready to describe the mathematical theory that permits us to compute p-values in practice. We will also learn about confidence intervals and power calculations.
A first step in statistical inference is to understand what population you are interested in. In the mouse weight example, we have two populations: female mice on control diets and female mice on high fat diets, with weight being the outcome of interest. We consider this population to be fixed, and the randomness comes from the sampling. One reason we have been using this dataset as an example is because we happen to have the weights of all the mice of this type. We download this file to our working directory and read in to R:
dat <- read.csv("mice_pheno.csv")
We can then access the population values and determine, for example, how many we have. Here we compute the size of the control population:
We usually denote these values as \(x_1,\dots,x_m\). In this case, \(m\) is the number computed above. We can do the same for the high fat diet population:
and denote with \(y_1,\dots,y_n\).
We can then define summaries of interest for these populations, such as the mean and variance.
The mean:
\[\mu_X = \frac{1}{m}\sum_{i=1}^m x_i \mbox{ and } \mu_Y = \frac{1}{n} \sum_{i=1}^n y_i\]
The variance:
\[\sigma_X^2 = \frac{1}{m}\sum_{i=1}^m (x_i-\mu_X)^2 \mbox{ and } \sigma_Y^2 = \frac{1}{n} \sum_{i=1}^n (y_i-\mu_Y)^2\]
with the standard deviation being the square root of the variance. We refer to such quantities that can be obtained from the population as population parameters. The question we started out asking can now be written mathematically: is \(\mu_Y - \mu_X = 0\) ?
Although in our illustration we have all the values and can check if this is true, in practice we do not. For example, in practice it would be prohibitively expensive to buy all the mice in a population. Here we learn how taking a sample permits us to answer our questions. This is the essence of statistical inference.
In the previous chapter, we obtained samples of 12 mice from each population. We represent data from samples with capital letters to indicate that they are random. This is common practice in statistics, although it is not always followed. So the samples are \(X_1,\dots,X_M\) and \(Y_1,\dots,Y_N\) and, in this case, \(N=M=12\). In contrast and as we saw above, when we list out the values of the population, which are set and not random, we use lower-case letters.
Since we want to know if \(\mu_Y - \mu_X\) is 0, we consider the sample version: \(\bar{Y}-\bar{X}\) with:
\[ \bar{X}=\frac{1}{M} \sum_{i=1}^M X_i \mbox{ and }\bar{Y}=\frac{1}{N} \sum_{i=1}^N Y_i. \]
Note that this difference of averages is also a random variable. Previously, we learned about the behavior of random variables with an exercise that involved repeatedly sampling from the original distribution. Of course, this is not an exercise that we can execute in practice. In this particular case it would involve buying 24 mice over and over again. Here we described the mathematical theory that mathematically relates \(\bar{X}\) to \(\mu_X\) and \(\bar{Y}\) to \(\mu_Y\), that will in turn help us understand the relationship between \(\bar{Y}-\bar{X}\) and \(\mu_Y - \mu_X\). Specifically, we will describe how the Central Limit Theorem permits us to use an approximation to answer this question, as well as motivate the widely used t-distribution.
Here we show the exploratory plots offered by the cummeRbund package. These plots require loading in a directory in which results from a Cufflinks analysis has been run. Follow the vignette in the above link in order in order to perform a Cufflinks gene- and isoform-level analysis. From the vignette:
CummeRbund begins by re-organizing output files of a cuffdiff analysis, and storing these data in a local SQLite database. CummeRbund indexes the data to speed up access to specific feature data (genes, isoforms, TSS, CDS, etc.), and preserves the various relationships between these features.
library(cummeRbund)
## Warning: package 'rtracklayer' was built under R version 4.3.1
## Warning: package 'Gviz' was built under R version 4.3.1
myDir <- system.file("extdata", package="cummeRbund")
gtfFile <- system.file("extdata/chr1_snippet.gtf",package="cummeRbund")
Read in the prepared Cufflinks files from the directory:
cuff <- readCufflinks(dir=myDir,gtfFile=gtfFile,genome="hg19",rebuild=TRUE)
Boxplots of expression (FPKM) at the gene and isoform level:
csBoxplot(genes(cuff))
csBoxplot(genes(cuff),replicates=TRUE)
csBoxplot(isoforms(cuff),replicates=TRUE)
Scatterplot matrix of gene and isoform level expression:
csScatterMatrix(genes(cuff))
## Warning: The dot-dot notation (`..scaled..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(scaled)` instead.
## ℹ The deprecated feature was likely used in the cummeRbund package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
csScatterMatrix(isoforms(cuff))
Sample dendrograms using Jensen-Shannon distances:
csDendro(genes(cuff),replicates=TRUE)
## 'dendrogram' with 2 branches and 6 members total, at height 0.2685017
csDendro(isoforms(cuff),replicates=TRUE)
## 'dendrogram' with 2 branches and 6 members total, at height 0.4377249
MA-plot comparing two conditions:
MAplot(genes(cuff),"hESC","Fibroblasts")
## Warning: Removed 54 rows containing missing values (`geom_point()`).
MAplot(isoforms(cuff),"hESC","Fibroblasts")
## Warning: Removed 187 rows containing missing values (`geom_point()`).
A “volcano plot” matrix. Each volcano plot is the -log10(p-value) over the log fold change.
csVolcanoMatrix(genes(cuff))
csVolcanoMatrix(isoforms(cuff))
For all of these functions, see the help pages in the cummeRbund package for more details, and check the vignette for a sample workflow. The Cufflinks homepage has details about running the pipeline upstream of producing these figures.
browseVignettes("cummeRbund")
This section is based on a talk by Karl W. Broman titled “How to Display Data Badly,” in which he described how the default plots offered by Microsoft Excel “obscure your data and annoy your readers” (here is a link to a collection of Karl Broman’s talks). His lecture was inspired by the 1984 paper by H. Wainer: How to display data badly. American Statistician 38(2): 137–147. Dr. Wainer was the first to elucidate the principles of the bad display of data. However, according to Karl Broman, “The now widespread use of Microsoft Excel has resulted in remarkable advances in the field.” Here we show examples of “bad plots” and how to improve them in R.
The aim of good data graphics is to display data accurately and clearly. According to Karl Broman, some rules for displaying data badly are:
Let’s say we want to report the results from a poll asking about browser preference (taken in August 2013). The standard way of displaying these is with a pie chart:
pie(browsers,main="Browser Usage (August 2013)")
Pie chart of browser usage.
Nonetheless, as stated by the help file for the pie
function:
“Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.”
To see this, look at the figure above and try to determine the percentages just from looking at the plot. Unless the percentages are close to 25%, 50% or 75%, this is not so easy. Simply showing the numbers is not only clear, but also saves on printing costs.
browsers
## Opera Safari Firefox IE Chrome
## 1 9 20 26 44
If you do want to plot them, then a barplot is appropriate. Here we add horizontal lines at every multiple of 10 and then redraw the bars:
barplot(browsers, main="Browser Usage (August 2013)", ylim=c(0,55))
abline(h=1:5 * 10)
barplot(browsers, add=TRUE)
Barplot of browser usage.
Notice that we can now pretty easily determine the percentages by following a horizontal line to the x-axis. Do avoid a 3D version since it obfuscates the plot, making it more difficult to find the percentages by eye.
Even worse than pie charts are donut plots.
The reason is that by removing the center, we remove one of the visual cues for determining the different areas: the angles. There is no reason to ever use a donut plot to display data.
While barplots are useful for showing percentages, they are incorrectly used to display data from two groups being compared. Specifically, barplots are created with height equal to the group means; an antenna is added at the top to represent standard errors. This plot is simply showing two numbers per group and the plot adds nothing:
Much more informative is to summarize with a boxplot. If the number
of points is small enough, we might as well add them to the plot. When
the number of points is too large for us to see them, just showing a
boxplot is preferable. We can even set range=0 in
boxplot to avoid drawing many outliers when the data is in
the range of millions.
Let’s recreate these barplots as boxplots. First let’s download the data:
library(downloader)
filename <- "fig1.RData"
url <- "https://github.com/kbroman/Talk_Graphs/raw/master/R/fig1.RData"
if (!file.exists(filename)) download(url,filename)
load(filename)
Now we can simply show the points and make simple boxplots:
library(rafalib)
mypar()
dat <- list(Treatment=x,Control=y)
boxplot(dat,xlab="Group",ylab="Response",cex=0)
stripchart(dat,vertical=TRUE,method="jitter",pch=16,add=TRUE,col=1)
Treatment data and control data shown with a boxplot.
Notice how much more we see here: the center, spread, range, and the points themselves. In the barplot, we only see the mean and the SE, and the SE has more to do with sample size than with the spread of the data.
This problem is magnified when our data has outliers or very large tails. In the plot below, there appears to be very large and consistent differences between the two groups:
However, a quick look at the data demonstrates that this difference is mostly driven by just two points. A version showing the data in the log-scale is much more informative.
Start by downloading data:
library(downloader)
url <- "https://github.com/kbroman/Talk_Graphs/raw/master/R/fig3.RData"
filename <- "fig3.RData"
if (!file.exists(filename)) download(url, filename)
load(filename)
Now we can show data and boxplots in original scale and log-scale.
library(rafalib)
mypar(1,2)
dat <- list(Treatment=x,Control=y)
boxplot(dat,xlab="Group",ylab="Response",cex=0)
stripchart(dat,vertical=TRUE,method="jitter",pch=16,add=TRUE,col=1)
boxplot(dat,xlab="Group",ylab="Response",log="y",cex=0)
stripchart(dat,vertical=TRUE,method="jitter",pch=16,add=TRUE,col=1)
Data and boxplots for original data (left) and in log scale (right).
The purpose of many statistical analyses is to determine relationships between two variables. Sample correlations are typically reported and sometimes plots are displayed to show this. However, showing just the regression line is one way to display your data badly since it hides the scatter. Surprisingly, plots such as the following are commonly seen.
Again start by loading data:
url <- "https://github.com/kbroman/Talk_Graphs/raw/master/R/fig4.RData"
filename <- "fig4.RData"
if (!file.exists(filename)) download(url, filename)
load(filename)
mypar(1,2)
plot(x,y,lwd=2,type="n")
fit <- lm(y~x)
abline(fit$coef,lwd=2)
b <- round(fit$coef,4)
text(78, 200, paste("y =", b[1], "+", b[2], "x"), adj=c(0,0.5))
rho <- round(cor(x,y),4)
text(78, 187,expression(paste(rho," = 0.8567")),adj=c(0,0.5))
plot(x,y,lwd=2)
fit <- lm(y~x)
abline(fit$coef,lwd=2)
The plot on the left shows a regression line that was fitted to the data shown on the right. It is much more informative to show all the data.
When there are large amounts of points, the scatter can be shown by
binning in two dimensions and coloring the bins by the number of points
in the bin. An example of this is the hexbin function in
the hexbin
package.
When new technologies or laboratory techniques are introduced, we are often shown scatter plots and correlations from replicated samples. High correlations are used to demonstrate that the new technique is reproducible. Correlation, however, can be very misleading. Below is a scatter plot showing data from replicated samples run on a high throughput technology. This technology outputs 12,626 simultaneous measurements.
In the plot on the left, we see the original data which shows very high correlation. Yet the data follows a distribution with very fat tails. Furthermore, 95% of the data is below the green line. The plot on the right is in the log scale:
Gene expression data from two replicated samples. Left is in original scale and right is in log scale.
Note that we do not show the code here as it is rather complex but we explain how to make MA plots in a later chapter.
Although the correlation is reduced in the log-scale, it is very close to 1 in both cases. Does this mean these data are reproduced? To examine how well the second vector reproduces the first, we need to study the differences. We therefore should plot that instead. In this plot, we plot the difference (in the log scale) versus the average:
MA plot of the same data shown above shows that data is not replicated very well despite a high correlation.
These are referred to as Bland-Altman plots, or MA plots in the genomics literature, and we will talk more about them later. “MA” stands for “minus” and “average” because in this plot, the y-axis is the difference between two samples on the log scale (the log ratio is the difference of the logs), and the x-axis is the average of the samples on the log scale. In this plot, we see that the typical difference in the log (base 2) scale between two replicated measures is about 1. This means that when measurements should be the same, we will, on average, observe 2 fold difference. We can now compare this variability to the differences we want to detect and decide if this technology is precise enough for our purposes.
A common task in data analysis is the comparison of two groups. When the dataset is small and data are paired, such as the outcomes before and after a treatment, two-color barplots are unfortunately often used to display the results.
There are better ways of showing these data to illustrate that there is an increase after treatment. One is to simply make a scatter plot, which shows that most points are above the identity line. Another alternative is to plot the differences against the before values.
set.seed(12201970)
before <- runif(6, 5, 8)
after <- rnorm(6, before*1.05, 2)
li <- range(c(before, after))
ymx <- max(abs(after-before))
mypar(1,2)
plot(before, after, xlab="Before", ylab="After",
ylim=li, xlim=li)
abline(0,1, lty=2, col=1)
plot(before, after-before, xlab="Before", ylim=c(-ymx, ymx),
ylab="Change (After - Before)", lwd=2)
abline(h=0, lty=2, col=1)
For two variables a scatter plot or a ‘rotated’ plot similar to an MA plot is much more informative.
Line plots are not a bad choice, although I find them harder to follow than the previous two. Boxplots show you the increase, but lose the paired information.
z <- rep(c(0,1), rep(6,2))
mypar(1,2)
plot(z, c(before, after),
xaxt="n", ylab="Response",
xlab="", xlim=c(-0.5, 1.5))
axis(side=1, at=c(0,1), c("Before","After"))
segments(rep(0,6), before, rep(1,6), after, col=1)
boxplot(before,after,names=c("Before","After"),ylab="Response")
Another alternative is a line plot. If we don’t care about pairings, then the boxplot is appropriate.
The figure below shows three curves. Pseudo 3D is used, but it is not clear why. Maybe to separate the three curves? Notice how difficult it is to determine the values of the curves at any given point:
This plot can be made better by simply using color to distinguish the three lines:
##First read data
url <- "https://github.com/kbroman/Talk_Graphs/raw/master/R/fig8dat.csv"
x <- read.csv(url)
##Now make alternative plot
plot(x[,1],x[,2],xlab="log Dose",ylab="Proportion survived",ylim=c(0,1),
type="l",lwd=2,col=1)
lines(x[,1],x[,3],lwd=2,col=2)
lines(x[,1],x[,4],lwd=2,col=3)
legend(1,0.4,c("Drug A","Drug B","Drug C"),lwd=2, col=1:3)
This plot demonstrates that using color is more than enough to distinguish the three lines.
In this example, we generate data with a simulation. We are studying a dose-response relationship between two groups: treatment and control. We have three groups of measurements for both control and treatment. Comparing treatment and control using the common barplot.
Instead, we should show each curve. We can use color to distinguish treatment and control, and dashed and solid lines to distinguish the original data from the mean of the three groups.
plot(x, y1, ylim=c(0,1), type="n", xlab="Dose", ylab="Response")
for(i in 1:3) lines(x, y[,i], col=1, lwd=1, lty=2)
for(i in 1:3) lines(x, z[,i], col=2, lwd=1, lty=2)
lines(x, ym, col=1, lwd=2)
lines(x, zm, col=2, lwd=2)
legend("bottomleft", lwd=2, col=c(1, 2), c("Control", "Treated"))
Because dose is an important factor, we show it in this plot.
By default, statistical software like R returns many significant digits. This does not mean we should report them. Cutting and pasting directly from R is a bad idea since you might end up showing a table, such as the one below, comparing the heights of basketball players:
heights <- cbind(rnorm(8,73,3),rnorm(8,73,3),rnorm(8,80,3),
rnorm(8,78,3),rnorm(8,78,3))
colnames(heights)<-c("SG","PG","C","PF","SF")
rownames(heights)<- paste("team",1:8)
heights
## SG PG C PF SF
## team 1 76.39843 76.21026 81.68291 75.32815 77.18792
## team 2 74.14399 71.10380 80.29749 81.58405 73.01144
## team 3 71.51120 69.02173 85.80092 80.08623 72.80317
## team 4 78.71579 72.80641 81.33673 76.30461 82.93404
## team 5 73.42427 73.27942 79.20283 79.71137 80.30497
## team 6 72.93721 71.81364 77.35770 81.69410 80.39703
## team 7 68.37715 73.01345 79.10755 71.24982 77.19851
## team 8 73.77538 75.59278 82.99395 75.57702 87.68162
We are reporting precision up to 0.00001 inches. Do you know of a tape measure with that much precision? This can be easily remedied:
round(heights,1)
## SG PG C PF SF
## team 1 76.4 76.2 81.7 75.3 77.2
## team 2 74.1 71.1 80.3 81.6 73.0
## team 3 71.5 69.0 85.8 80.1 72.8
## team 4 78.7 72.8 81.3 76.3 82.9
## team 5 73.4 73.3 79.2 79.7 80.3
## team 6 72.9 71.8 77.4 81.7 80.4
## team 7 68.4 73.0 79.1 71.2 77.2
## team 8 73.8 75.6 83.0 75.6 87.7
In general, you should follow these principles:
Some further reading:
The use of correlation to summarize reproducibility has become widespread in, for example, genomics. Despite its English language definition, mathematically, correlation is not necessarily informative with regards to reproducibility. Here we briefly describe three major problems.
The most egregious related mistake is to compute correlations of data that are not approximated by bivariate normal data. As described above, averages, standard deviations and correlations are popular summary statistics for two-dimensional data because, for the bivariate normal distribution, these five parameters fully describe the distribution. However, there are many examples of data that are not well approximated by bivariate normal data. Raw gene expression data, for example, tends to have a distribution with a very fat right tail.
The standard way to quantify reproducibility between two sets of replicated measurements, say \(x_1,\dots,x_n\) and \(y_1,\dots,y_n\), is simply to compute the distance between them:
\[ \sqrt{ \sum_{i=1}^n d_i^2} \mbox{ with } d_i=x_i - y_i \]
This metric decreases as reproducibility improves and it is 0 when the reproducibility is perfect. Another advantage of this metric is that if we divide the sum by N, we can interpret the resulting quantity as the standard deviation of the \(d_1,\dots,d_N\) if we assume the \(d\) average out to 0. If the \(d\) can be considered residuals, then this quantity is equivalent to the root mean squared error (RMSE), a summary statistic that has been around for over a century. Furthermore, this quantity will have the same units as our measurements resulting in a more interpretable metric.
Another limitation of the correlation is that it does not detect cases that are not reproducible due to average changes. The distance metric does detect these differences. We can rewrite:
\[\frac{1}{n} \sum_{i=1}^n (x_i - y_i)^2 = \frac{1}{n} \sum_{i=1}^n [(x_i - \mu_x) - (y_i - \mu_y) + (\mu_x - \mu_y)]^2\]
with \(\mu_x\) and \(\mu_y\) the average of each list. Then we have:
\[\frac{1}{n} \sum_{i=1}^n (x_i - y_i)^2 = \frac{1}{n} \sum_{i=1}^n (x_i-\mu_x)^2 + \frac{1}{n} \sum_{i=1}^n (y_i - \mu_y)^2 + (\mu_x-\mu_y)^2 + \frac{1}{n}\sum_{i=1}^n (x_i-\mu_x)(y_i - \mu_y) \]
For simplicity, if we assume that the variance of both lists is 1, then this reduces to:
\[\frac{1}{n} \sum_{i=1}^n (x_i - y_i)^2 = 2 + (\mu_x-\mu_y)^2 - 2\rho\]
with \(\rho\) the correlation. So we see the direct relationship between distance and correlation. However, an important difference is that the distance contains the term \[(\mu_x-\mu_y)^2\] and, therefore, it can detect cases that are not reproducible due to large average changes.
Yet another reason correlation is not an optimal metric for reproducibility is the lack of units. To see this, we use a formula that relates the correlation of a variable with that variable, plus what is interpreted here as deviation: \(x\) and \(y=x+d\). The larger the variance of \(d\), the less \(x+d\) reproduces \(x\). Here the distance metric would depend only on the variance of \(d\) and would summarize reproducibility. However, correlation depends on the variance of \(x\) as well. If \(d\) is independent of \(x\), then
\[ \mbox{cor}(x,y) = \frac{1}{\sqrt{1+\mbox{var}(d)/\mbox{var}(x)}} \]
This suggests that correlations near 1 do not necessarily imply reproducibility. Specifically, irrespective of the variance of \(d\), we can make the correlation arbitrarily close to 1 by increasing the variance of \(x\).
Visualizing data is one of the most, if not the most, important step in the analysis of high-throughput data. The right visualization method may reveal problems with the experimental data that can render the results from a standard analysis, although typically appropriate, completely useless.
We have shown methods for visualizing global properties of the columns or rows, but plots that reveal relationships between columns or between rows are more complicated due to the high dimensionality of data. For example, to compare each of the 189 samples to each other, we would have to create, for example, 17,766 MA-plots. Creating one single scatterplot of the data is impossible since points are very high dimensional.
We will describe powerful techniques for exploratory data analysis based on dimension reduction. The general idea is to reduce the dataset to have fewer dimensions, yet approximately preserve important properties, such as the distance between samples. If we are able to reduce down to, say, two dimensions, we can then easily make plots. The technique behind it all, the singular value decomposition (SVD), is also useful in other contexts. Before introducing the rather complicated mathematics behind the SVD, we will motivate the ideas behind it with a simple example.
We consider an example with twin heights. Here we simulate 100 two dimensional points that represent the number of standard deviations each individual is from the mean height. Each point is a pair of twins:
Simulated twin pair heights.
To help with the illustration, think of this as high-throughput gene expression data with the twin pairs representing the \(N\) samples and the two heights representing gene expression from two genes.
We are interested in the distance between any two samples. We can
compute this using dist. For example, here is the distance
between the two orange points in the figure above:
d=dist(t(y))
as.matrix(d)[1,2]
## [1] 1.140897
What if making two dimensional plots was too complex and we were only able to make 1 dimensional plots. Can we, for example, reduce the data to a one dimensional matrix that preserves distances between points?
If we look back at the plot, and visualize a line between any pair of points, the length of this line is the distance between the two points. These lines tend to go along the direction of the diagonal. We have seen before that we can “rotate” the plot so that the diagonal is in the x-axis by making a MA-plot instead:
z1 = (y[1,]+y[2,])/2 #the sum
z2 = (y[1,]-y[2,]) #the difference
z = rbind( z1, z2) #matrix now same dimensions as y
thelim <- c(-3,3)
mypar(1,2)
plot(y[1,],y[2,],xlab="Twin 1 (standardized height)",
ylab="Twin 2 (standardized height)",
xlim=thelim,ylim=thelim)
points(y[1,1:2],y[2,1:2],col=2,pch=16)
plot(z[1,],z[2,],xlim=thelim,ylim=thelim,xlab="Average height",ylab="Difference in height")
points(z[1,1:2],z[2,1:2],col=2,pch=16)
Twin height scatterplot (left) and MA-plot (right).
Later, we will start using linear algebra to represent transformation
of the data such as this. Here we can see that to get z we
multiplied y by the matrix:
\[ A = \, \begin{pmatrix} 1/2&1/2\\ 1&-1\\ \end{pmatrix} \implies z = A y \]
Remember that we can transform back by simply multiplying by \(A^{-1}\) as follows:
\[ A^{-1} = \, \begin{pmatrix} 1&1/2\\ 1&-1/2\\ \end{pmatrix} \implies y = A^{-1} z \]
In the plot above, the distance between the two orange points remains
roughly the same, relative to the distance between other points. This is
true for all pairs of points. A simple re-scaling of the transformation
we performed above will actually make the distances exactly the same.
What we will do is multiply by a scalar so that the standard deviations
of each point is preserved. If you think of the columns of
y as independent random variables with standard deviation
\(\sigma\), then note that the standard
deviations of \(M\) and \(A\) are:
\[ \mbox{sd}[ Z_1 ] = \mbox{sd}[ (Y_1 + Y_2) / 2 ] = \frac{1}{\sqrt{2}} \sigma \mbox{ and } \mbox{sd}[ Z_2] = \mbox{sd}[ Y_1 - Y_2 ] = {\sqrt{2}} \sigma \]
This implies that if we change the transformation above to:
\[ A = \frac{1}{\sqrt{2}} \begin{pmatrix} 1&1\\ 1&-1\\ \end{pmatrix} \]
then the SD of the columns of \(Y\) are the same as the variance of the columns \(Z\). Also, notice that \(A^{-1}A=I\). We call matrices with these properties orthogonal and it guarantees the SD-preserving properties described above. The distances are now exactly preserved:
A <- 1/sqrt(2)*matrix(c(1,1,1,-1),2,2)
z <- A%*%y
d <- dist(t(y))
d2 <- dist(t(z))
mypar(1,1)
plot(as.numeric(d),as.numeric(d2)) #as.numeric turns distances into long vector
abline(0,1,col=2)
Distance computed from original data and after rotation is the same.
We call this particular transformation a rotation of
y.
mypar(1,2)
thelim <- c(-3,3)
plot(y[1,],y[2,],xlab="Twin 1 (standardized height)",
ylab="Twin 2 (standardized height)",
xlim=thelim,ylim=thelim)
points(y[1,1:2],y[2,1:2],col=2,pch=16)
plot(z[1,],z[2,],xlim=thelim,ylim=thelim,xlab="Average height",ylab="Difference in height")
points(z[1,1:2],z[2,1:2],col=2,pch=16)
Twin height scatterplot (left) and after rotation (right).
The reason we applied this transformation in the first place was
because we noticed that to compute the distances between points, we
followed a direction along the diagonal in the original plot, which
after the rotation falls on the horizontal, or the first dimension of
z. So this rotation actually achieves what we originally
wanted: we can preserve the distances between points with just one
dimension. Let’s remove the second dimension of z and
recompute distances:
d3 = dist(z[1,]) ##distance computed using just first dimension
mypar(1,1)
plot(as.numeric(d),as.numeric(d3))
abline(0,1)
Distance computed with just one dimension after rotation versus actual distance.
The distance computed with just the one dimension provides a very good approximation to the actual distance and a very useful dimension reduction: from 2 dimensions to 1. This first dimension of the transformed data is actually the first principal component. This idea motivates the use of principal component analysis (PCA) and the singular value decomposition (SVD) to achieve dimension reduction more generally.
If you search the web for descriptions of PCA, you will notice a difference in notation to how we describe it here. This mainly stems from the fact that it is more common to have rows represent units. Hence, in the example shown here, \(Y\) would be transposed to be an \(N \times 2\) matrix. In statistics this is also the most common way to represent the data: individuals in the rows. However, for practical reasons, in genomics it is more common to represent units in the columns. For example, genes are rows and samples are columns. For this reason, in this book we explain PCA and all the math that goes with it in a slightly different way than it is usually done. As a result, many of the explanations you find for PCA start out with the sample covariance matrix usually denoted with \(\mathbf{X}^\top\mathbf{X}\) and having cells representing covariance between two units. Yet for this to be the case, we need the rows of \(\mathbf{X}\) to represents units. So in our notation above, you would have to compute, after scaling, \(\mathbf{Y}\mathbf{Y}^\top\) instead.
Basically, if you want our explanations to match others you have to transpose the matrices we show here.
We have already mentioned principal component analysis (PCA) above and noted its relation to the SVD. Here we provide further mathematical details.
We started the motivation for dimension reduction with a simulated example and showed a rotation that is very much related to PCA.
Twin heights scatter plot.
Here we explain specifically what are the principal components (PCs).
Let \(\mathbf{Y}\) be \(2 \times N\) matrix representing our data. The analogy is that we measure expression from 2 genes and each column is a sample. Suppose we are given the task of finding a \(2 \times 1\) vector \(\mathbf{u}_1\) such that \(\mathbf{u}_1^\top \mathbf{v}_1 = 1\) and it maximizes \((\mathbf{u}_1^\top\mathbf{Y})^\top (\mathbf{u}_1^\top\mathbf{Y})\). This can be viewed as a projection of each sample or column of \(\mathbf{Y}\) into the subspace spanned by \(\mathbf{u}_1\). So we are looking for a transformation in which the coordinates show high variability.
Let’s try \(\mathbf{u}=(1,0)^\top\). This projection simply gives us the height of twin 1 shown in orange below. The sum of squares is shown in the title.
mypar(1,1)
plot(t(Y), xlim=thelim, ylim=thelim,
main=paste("Sum of squares :",round(crossprod(Y[1,]),1)))
abline(h=0)
apply(Y,2,function(y) segments(y[1],0,y[1],y[2],lty=2))
## NULL
points(Y[1,],rep(0,ncol(Y)),col=2,pch=16,cex=0.75)
Can we find a direction with higher variability? How about:
\(\mathbf{u} =\begin{pmatrix}1\\-1\end{pmatrix}\) ? This does not satisfy \(\mathbf{u}^\top\mathbf{u}= 1\) so let’s instead try \(\mathbf{u} =\begin{pmatrix}1/\sqrt{2}\\-1/\sqrt{2}\end{pmatrix}\)
u <- matrix(c(1,-1)/sqrt(2),ncol=1)
w=t(u)%*%Y
mypar(1,1)
plot(t(Y),
main=paste("Sum of squares:",round(tcrossprod(w),1)),xlim=thelim,ylim=thelim)
abline(h=0,lty=2)
abline(v=0,lty=2)
abline(0,-1,col=2)
Z = u%*%w
for(i in seq(along=w))
segments(Z[1,i],Z[2,i],Y[1,i],Y[2,i],lty=2)
points(t(Z), col=2, pch=16, cex=0.5)
Data projected onto space spanned by (1 0).
This relates to the difference between twins, which we know is small. The sum of squares confirms this.
Finally, let’s try:
\(\mathbf{u} =\begin{pmatrix}1/\sqrt{2}\\1/\sqrt{2}\end{pmatrix}\)
u <- matrix(c(1,1)/sqrt(2),ncol=1)
w=t(u)%*%Y
mypar()
plot(t(Y), main=paste("Sum of squares:",round(tcrossprod(w),1)),
xlim=thelim, ylim=thelim)
abline(h=0,lty=2)
abline(v=0,lty=2)
abline(0,1,col=2)
points(u%*%w, col=2, pch=16, cex=1)
Z = u%*%w
for(i in seq(along=w))
segments(Z[1,i], Z[2,i], Y[1,i], Y[2,i], lty=2)
points(t(Z),col=2,pch=16,cex=0.5)
Data projected onto space spanned by first PC.
This is a re-scaled average height, which has higher sum of squares. There is a mathematical procedure for determining which \(\mathbf{v}\) maximizes the sum of squares and the SVD provides it for us.
The orthogonal vector that maximizes the sum of squares:
\[(\mathbf{u}_1^\top\mathbf{Y})^\top(\mathbf{u}_1^\top\mathbf{Y})\]
\(\mathbf{u}_1^\top\mathbf{Y}\) is referred to as the first PC. The weights \(\mathbf{u}\) used to obtain this PC are referred to as the loadings. Using the language of rotations, it is also referred to as the direction of the first PC, which are the new coordinates.
To obtain the second PC, we repeat the exercise above, but for the residuals:
\[\mathbf{r} = \mathbf{Y} - \mathbf{u}_1^\top \mathbf{Yv}_1 \]
The second PC is the vector with the following properties:
\[ \mathbf{v}_2^\top \mathbf{v}_2=1\]
\[ \mathbf{v}_2^\top \mathbf{v}_1=0\]
and maximizes \((\mathbf{rv}_2)^\top \mathbf{rv}_2\).
When \(Y\) is \(N \times m\) we repeat to find 3rd, 4th, …, m-th PCs.
prcompWe have shown how to obtain PCs using the SVD. However, R has a function specifically designed to find the principal components. In this case, the data is centered by default. The following function:
pc <- prcomp( t(Y) )
produces the same results as the SVD up to arbitrary sign flips:
s <- svd( Y - rowMeans(Y) )
mypar(1,2)
for(i in 1:nrow(Y) ){
plot(pc$x[,i], s$d[i]*s$v[,i])
}
Plot showing SVD and prcomp give same results.
The loadings can be found this way:
pc$rotation
## PC1 PC2
## [1,] 0.7072304 0.7069831
## [2,] 0.7069831 -0.7072304
which are equivalent (up to a sign flip) to:
s$u
## [,1] [,2]
## [1,] -0.7072304 -0.7069831
## [2,] -0.7069831 0.7072304
The equivalent of the variance explained is included in the:
pc$sdev
## [1] 1.2542672 0.2141882
component.
We take the transpose of Y because prcomp
assumes the previously discussed ordering: units/samples in row and
features in columns.
Now that we have described the concept of dimension reduction and some of the applications of SVD and principal component analysis, we focus on more details related to the mathematics behind these. We start with projections. A projection is a linear algebra concept that helps us understand many of the mathematical operations we perform on high-dimensional data. For more details, you can review projects in a linear algebra book. Here we provide a quick review and then provide some data analysis related examples.
As a review, remember that projections minimize the distance between points and subspace.
We illustrate projections using a figure, in which the arrow on top is pointing to a point in space. In this particular cartoon, the space is two dimensional, but we should be thinking abstractly. The space is represented by the Cartesian plan and the line on which the little person stands is a subspace of points. The projection to this subspace is the place that is closest to the original point. Geometry tells us that we can find this closest point by dropping a perpendicular line (dotted line) from the point to the space. The little person is standing on the projection. The amount this person had to walk from the origin to the new projected point is referred to as the coordinate.
For the explanation of projections, we will use the standard matrix algebra notation for points: \(\vec{y} \in \mathbb{R}^N\) is a point in \(N\)-dimensional space and \(L \subset \mathbb{R}^N\) is smaller subspace.
If we let \(Y = \begin{pmatrix} 2 \\ 3\end{pmatrix}\). We can plot it like this:
mypar (1,1)
plot(c(0,4),c(0,4),xlab="Dimension 1",ylab="Dimension 2",type="n")
arrows(0,0,2,3,lwd=3)
text(2,3," Y",pos=4,cex=3)
Geometric representation of Y.
We can immediately define a coordinate system by projecting this vector to the space defined by: \(\begin{pmatrix} 1\\ 0\end{pmatrix}\) (the x-axis) and \(\begin{pmatrix} 0\\ 1\end{pmatrix}\) (the y-axis). The projections of \(Y\) to the subspace defined by these points are 2 and 3 respectively:
\[ \begin{align*} Y &= \begin{pmatrix} 2 \\ 3\end{pmatrix} \\ &=2 \begin{pmatrix} 1\\ 0\end{pmatrix} + 3 \begin{pmatrix} 0\\ 1\end{pmatrix} \end{align*}\]
We say that \(2\) and \(3\) are the coordinates and that \(\begin{pmatrix} 1\\ 0\end{pmatrix} \mbox{and} \begin{pmatrix} 0\\1 \end{pmatrix}\) are the bases.
Now let’s define a new subspace. The red line in the plot below is subset \(L\) defined by points satisfying \(c \vec{v}\) with \(\vec{v}=\begin{pmatrix} 2& 1\end{pmatrix}^\top\). The projection of \(\vec{y}\) onto \(L\) is the closest point on \(L\) to \(\vec{y}\). So we need to find the \(c\) that minimizes the distance between \(\vec{y}\) and \(c\vec{v}=(2c,c)\). In linear algebra, we learn that the difference between these points is orthogonal to the space so:
\[ (\vec{y}-\hat{c}\vec{v}) \cdot \vec{v} = 0 \]
this implies that:
\[ \vec{y}\cdot\vec{v} - \hat{c}\vec{v}\cdot\vec{v} = 0 \]
and:
\[\hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}}\]
Here the dot \(\cdot\) represents the dot product: \(\,\, \vec{x} \cdot \vec{y} = x_1 y_1+\dots x_n y_n\).
The following R code confirms this equation works:
mypar(1,1)
plot(c(0,4),c(0,4),xlab="Dimension 1",ylab="Dimension 2",type="n")
arrows(0,0,2,3,lwd=3)
abline(0,0.5,col="red",lwd=3) #if x=2c and y=c then slope is 0.5 (y=0.5x)
text(2,3," Y",pos=4,cex=3)
y=c(2,3)
x=c(2,1)
cc = crossprod(x,y)/crossprod(x)
segments(x[1]*cc,x[2]*cc,y[1],y[2],lty=2)
text(x[1]*cc,x[2]*cc,expression(hat(Y)),pos=4,cex=3)
Projection of Y onto new subspace.
Note that if \(\vec{v}\) was such that \(\vec{v}\cdot \vec{v}=1\), then \(\hat{c}\) is simply \(\vec{y} \cdot \vec{v}\) and the space \(L\) does not change. This simplification is one reason we like orthogonal matrices.
Let \(\vec{y} \in \mathbb{R}^N\) and \(L \subset \mathbb{R}^N\) is the space spanned by:
\[\vec{v}=\begin{pmatrix} 1\\ \vdots \\ 1\end{pmatrix}; L = \{ c \vec{v}; c \in \mathbb{R}\}\]
In this space, all components of the vectors are the same number, so we can think of this space as representing the constants: in the projection each dimension will be the same value. So what \(c\) minimizes the distance between \(c\vec{v}\) and \(\vec{y}\) ?
When talking about problems like this, we sometimes use two dimensional figures such as the one above. We simply abstract and think of \(\vec{y}\) as a point in \(N-dimensions\) and \(L\) as a subspace defined by a smaller number of values, in this case just one: \(c\).
Getting back to our question, we know that the projection is:
\[\hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}}\]
which in this case is the average:
\[ \hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} = \frac{\sum_{i=1}^N Y_i}{\sum_{i=1}^N 1} = \bar{Y} \]
Here, it also would have been just as easy to use calculus:
\[\frac{\partial}{\partial c}\sum_{i=1}^N (Y_i - c)^2 = 0 \implies - 2 \sum_{i=1}^N (Y_i - \hat{c}) = 0 \implies\]
\[ N c = \sum_{i=1}^N Y_i \implies \hat{c}=\bar{Y}\]
Let us give a slightly more complicated example. Simple linear regression can also be explained with projections. Our data \(\mathbf{Y}\) (we are no longer going to use the \(\vec{y}\) notation) is again an N-dimensional vector and our model predicts \(Y_i\) with a line \(\beta_0 + \beta_1 X_i\). We want to find the \(\beta_0\) and \(\beta_1\) that minimize the distance between \(Y\) and the space defined by:
\[ L = \{ \beta_0 \vec{v}_0 + \beta_1 \vec{v}_1 ; \vec{\beta}=(\beta_0,\beta_1) \in \mathbb{R}^2 \}\]
with:
\[ \vec{v}_0= \begin{pmatrix} 1\\ 1\\ \vdots \\ 1\\ \end{pmatrix} \mbox{ and } \vec{v}_1= \begin{pmatrix} X_{1}\\ X_{2}\\ \vdots \\ X_{N}\\ \end{pmatrix} \]
Our \(N\times 2\) matrix \(\mathbf{X}\) is \([ \vec{v}_0 \,\, \vec{v}_1]\) and any point in \(L\) can be written as \(X\vec{\beta}\).
The equation for the multidimensional version of orthogonal projection is:
\[X^\top (\vec{y}-X\vec{\beta}) = 0\]
which we have seen before and gives us:
\[X^\top X \hat{\beta}= X^\top \vec{y} \]
\[\hat{\beta}= (X^\top X)^{-1}X^\top \vec{y}\]
And the projection to \(L\) is therefore:
\[X (X^\top X)^{-1}X^\top \vec{y}\]
We have seen that in order to calculate the LSE, we need to invert a
matrix. In previous sections we used the function solve.
However, solve is not a stable solution. When coding LSE computation, we
use the QR decomposition.
Remember that to minimize the RSS:
\[ (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^\top (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}) \]
We need to solve:
\[ \mathbf{X}^\top \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}^\top \mathbf{Y} \]
The solution is:
\[ \boldsymbol{\hat{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]
Thus, we need to compute \((\mathbf{X}^\top \mathbf{X})^{-1}\).
solve is numerically unstableTo demonstrate what we mean by numerically unstable, we construct an extreme case:
n <- 50;M <- 500
x <- seq(1,M,len=n)
X <- cbind(1,x,x^2,x^3)
colnames(X) <- c("Intercept","x","x2","x3")
beta <- matrix(c(1,1,1,1),4,1)
set.seed(1)
y <- X%*%beta+rnorm(n,sd=1)
The standard R function for inverse gives an error:
solve(crossprod(X)) %*% crossprod(X,y)
To see why this happens, look at \((\mathbf{X}^\top \mathbf{X})\)
options(digits=4)
log10(crossprod(X))
## Intercept x x2 x3
## Intercept 1.699 4.098 6.625 9.203
## x 4.098 6.625 9.203 11.810
## x2 6.625 9.203 11.810 14.434
## x3 9.203 11.810 14.434 17.070
Note the difference of several orders of magnitude. On a digital computer, we have a limited range of numbers. This makes some numbers seem like 0, when we also have to consider very large numbers. This in turn leads to divisions that are practically divisions by 0 errors.
The QR factorization is based on a mathematical result that tells us that we can decompose any full rank \(N\times p\) matrix \(\mathbf{X}\) as:
\[ \mathbf{X = QR} \]
with:
Upper triangular matrices are very convenient for solving system of equations.
In the example below, the matrix on the left is upper triangular: it only has 0s below the diagonal. This facilitates solving the system of equations greatly:
\[ \, \begin{pmatrix} 1&2&-1\\ 0&1&2\\ 0&0&1\\ \end{pmatrix} \begin{pmatrix} a\\ b\\ c\\ \end{pmatrix} = \begin{pmatrix} 6\\ 4\\ 1\\ \end{pmatrix} \]
We immediately know that \(c=1\), which implies that \(b+2=4\). This in turn implies \(b=2\) and thus \(a+4-1=6\) so \(a = 3\). Writing an algorithm to do this is straight-forward for any upper triangular matrix.
If we rewrite the equations of the LSE using \(\mathbf{QR}\) instead of \(\mathbf{X}\) we have:
\[\mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^\top \mathbf{Y}\]
\[(\mathbf{Q}\mathbf{R})^\top (\mathbf{Q}\mathbf{R}) \boldsymbol{\beta} = (\mathbf{Q}\mathbf{R})^\top \mathbf{Y}\]
\[\mathbf{R}^\top (\mathbf{Q}^\top \mathbf{Q}) \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{Y}\]
\[\mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{Y}\]
\[(\mathbf{R}^\top)^{-1} \mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = (\mathbf{R}^\top)^{-1} \mathbf{R}^\top \mathbf{Q}^\top \mathbf{Y}\]
\[\mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \mathbf{Y}\]
\(\mathbf{R}\) being upper triangular makes solving this more stable. Also, because \(\mathbf{Q}^\top\mathbf{Q}=\mathbf{I}\) , we know that the columns of \(\mathbf{Q}\) are in the same scale which stabilizes the right side.
Now we are ready to find LSE using the QR decomposition. To solve:
\[\mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \mathbf{Y}\]
We use backsolve which takes advantage of the upper
triangular nature of \(\mathbf{R}\).
QR <- qr(X)
Q <- qr.Q( QR )
R <- qr.R( QR )
(betahat <- backsolve(R, crossprod(Q,y) ) )
## [,1]
## [1,] 0.9038
## [2,] 1.0066
## [3,] 1.0000
## [4,] 1.0000
In practice, we do not need to do any of this due to the built-in
solve.qr function:
QR <- qr(X)
(betahat <- solve.qr(QR, y))
## [,1]
## Intercept 0.9038
## x 1.0066
## x2 1.0000
## x3 1.0000
This factorization also simplifies the calculation for fitted values:
\[\mathbf{X}\boldsymbol{\hat{\beta}} = (\mathbf{QR})\mathbf{R}^{-1}\mathbf{Q}^\top \mathbf{y}= \mathbf{Q}\mathbf{Q}^\top\mathbf{y} \]
In R, we simply do the following:
library(rafalib)
mypar(1,1)
plot(x,y)
fitted <- tcrossprod(Q)%*%y
lines(x,fitted,col=2)
To obtain the standard errors of the LSE, we note that:
\[(\mathbf{X^\top X})^{-1} = (\mathbf{R^\top Q^\top QR})^{-1} = (\mathbf{R^\top R})^{-1}\]
The function chol2inv is specifically designed to find
this inverse. So all we do is the following:
df <- length(y) - QR$rank
sigma2 <- sum((y-fitted)^2)/df
varbeta <- sigma2*chol2inv(qr.R(QR))
SE <- sqrt(diag(varbeta))
cbind(betahat,SE)
## SE
## Intercept 0.9038 4.508e-01
## x 1.0066 7.858e-03
## x2 1.0000 3.662e-05
## x3 1.0000 4.802e-08
This gives us identical results to the lm function.
summary(lm(y~0+X))$coef
## Estimate Std. Error t value Pr(>|t|)
## XIntercept 0.9038 4.508e-01 2.005e+00 5.089e-02
## Xx 1.0066 7.858e-03 1.281e+02 2.171e-60
## Xx2 1.0000 3.662e-05 2.731e+04 1.745e-167
## Xx3 1.0000 4.802e-08 2.082e+07 4.559e-300
In a previous section, we motivated the use of matrix algebra with this system of equations:
\[ \begin{align*} a + b + c &= 6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 1 \end{align*} \]
We described how this system can be rewritten and solved using matrix algebra:
\[ \, \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \implies \begin{pmatrix} a\\ b\\ c \end{pmatrix}= \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}^{-1} \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \]
Having described matrix notation, we will explain the operation we perform with them. For example, above we have matrix multiplication and we also have a symbol representing the inverse of a matrix. The importance of these operations and others will become clear once we present specific examples related to data analysis.
We start with one of the simplest operations: scalar multiplication. If \(a\) is scalar and \(\mathbf{X}\) is a matrix, then:
\[ \mathbf{X} = \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \implies a \mathbf{X} = \begin{pmatrix} a x_{1,1} & \dots & a x_{1,p}\\ a x_{2,1}&\dots & a x_{2,p} \\ & \vdots & \\ a x_{N,1} & \dots & a x_{N,p} \end{pmatrix} \]
R automatically follows this rule when we multiply a number by a
matrix using *:
X <- matrix(1:12,4,3)
print(X)
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
a <- 2
print(a*X)
## [,1] [,2] [,3]
## [1,] 2 10 18
## [2,] 4 12 20
## [3,] 6 14 22
## [4,] 8 16 24
The transpose is an operation that simply changes columns to rows. We use a \(\top\) to denote a transpose. The technical definition is as follows: if X is as we defined it above, here is the transpose which will be \(p\times N\):
\[ \mathbf{X} = \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \implies \mathbf{X}^\top = \begin{pmatrix} x_{1,1}&\dots & x_{p,1} \\ x_{1,2}&\dots & x_{p,2} \\ & \vdots & \\ x_{1,N}&\dots & x_{p,N} \end{pmatrix} \]
In R we simply use t:
X <- matrix(1:12,4,3)
X
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
t(X)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
We start by describing the matrix multiplication shown in the original system of equations example:
\[ \begin{align*} a + b + c &=6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 1 \end{align*} \]
What we are doing is multiplying the rows of the first matrix by the columns of the second. Since the second matrix only has one column, we perform this multiplication by doing the following:
\[ \, \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix}= \begin{pmatrix} a + b + c \\ 3a - 2b + c \\ 2a + b - c \end{pmatrix} \]
Here is a simple example. We can check to see if
abc=c(3,2,1) is a solution:
X <- matrix(c(1,3,2,1,-2,1,1,1,-1),3,3)
abc <- c(3,2,1) #use as an example
rbind( sum(X[1,]*abc), sum(X[2,]*abc), sum(X[3,]*abc))
## [,1]
## [1,] 6
## [2,] 6
## [3,] 7
We can use the %*% to perform the matrix multiplication
and make this much more compact:
X%*%abc
## [,1]
## [1,] 6
## [2,] 6
## [3,] 7
We can see that c(3,2,1) is not a solution as the answer
here is not the required c(6,2,1).
To get the solution, we will need to invert the matrix on the left, a concept we learn about below.
Here is the general definition of matrix multiplication of matrices \(A\) and \(X\):
\[ \mathbf{AX} = \begin{pmatrix} a_{1,1} & a_{1,2} & \dots & a_{1,N}\\ a_{2,1} & a_{2,2} & \dots & a_{2,N}\\ & & \vdots & \\ a_{M,1} & a_{M,2} & \dots & a_{M,N} \end{pmatrix} \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \]
\[ = \begin{pmatrix} \sum_{i=1}^N a_{1,i} x_{i,1} & \dots & \sum_{i=1}^N a_{1,i} x_{i,p}\\ & \vdots & \\ \sum_{i=1}^N a_{M,i} x_{i,1} & \dots & \sum_{i=1}^N a_{M,i} x_{i,p} \end{pmatrix} \]
You can only take the product if the number of columns of the first matrix \(A\) equals the number of rows of the second one \(X\). Also, the final matrix has the same row numbers as the first \(A\) and the same column numbers as the second \(X\). After you study the example below, you may want to come back and re-read the sections above.
The identity matrix is analogous to the number 1: if you multiply the identity matrix by another matrix, you get the same matrix. For this to happen, we need it to be like this:
\[ \mathbf{I} = \begin{pmatrix} 1&0&0&\dots&0&0\\ 0&1&0&\dots&0&0\\ 0&0&1&\dots&0&0\\ \vdots &\vdots & \vdots&\ddots&\vdots&\vdots\\ 0&0&0&\dots&1&0\\ 0&0&0&\dots&0&1 \end{pmatrix} \]
By this definition, the identity always has to have the same number of rows as columns or be what we call a square matrix.
If you follow the matrix multiplication rule above, you notice this works out:
\[ \mathbf{XI} = \begin{pmatrix} x_{1,1} & \dots & x_{1,p}\\ & \vdots & \\ x_{N,1} & \dots & x_{N,p} \end{pmatrix} \begin{pmatrix} 1&0&0&\dots&0&0\\ 0&1&0&\dots&0&0\\ 0&0&1&\dots&0&0\\ & & &\vdots& &\\ 0&0&0&\dots&1&0\\ 0&0&0&\dots&0&1 \end{pmatrix} = \begin{pmatrix} x_{1,1} & \dots & x_{1,p}\\ & \vdots & \\ x_{N,1} & \dots & x_{N,p} \end{pmatrix} \]
In R you can form an identity matrix this way:
n <- 5 #pick dimensions
diag(n)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
The inverse of matrix \(X\), denoted with \(X^{-1}\), has the property that, when multiplied, gives you the identity \(X^{-1}X=I\). Of course, not all matrices have inverses. For example, a \(2\times 2\) matrix with 1s in all its entries does not have an inverse.
As we will see when we get to the section on applications to linear
models, being able to compute the inverse of a matrix is quite useful. A
very convenient aspect of R is that it includes a predefined function
solve to do this. Here is how we would use it to solve the
linear of equations:
X <- matrix(c(1,3,2,1,-2,1,1,1,-1),3,3)
y <- matrix(c(6,2,1),3,1)
solve(X)%*%y #equivalent to solve(X,y)
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
Please note that solve is a function that should be used
with caution as it is not generally numerically stable. We explain this
in much more detail in the QR factorization section.
Now we are ready to see how matrix algebra can be useful when analyzing data. We start with some simple examples and eventually arrive at the main one: how to write linear models with matrix algebra notation and solve the least squares problem.
To compute the sample average and variance of our data, we use these formulas \(\bar{Y}=\frac{1}{N} Y_i\) and \(\mbox{var}(Y)=\frac{1}{N} \sum_{i=1}^N (Y_i - \bar{Y})^2\). We can represent these with matrix multiplication. First, define this \(N \times 1\) matrix made just of 1s:
\[ A=\begin{pmatrix} 1\\ 1\\ \vdots\\ 1 \end{pmatrix} \]
This implies that:
\[ \frac{1}{N} \mathbf{A}^\top Y = \frac{1}{N} \begin{pmatrix}1&1&\dots&1\end{pmatrix} \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix}= \frac{1}{N} \sum_{i=1}^N Y_i = \bar{Y} \]
Note that we are multiplying by the scalar \(1/N\). In R, we multiply matrix using
%*%:
data(father.son,package="UsingR")
y <- father.son$sheight
print(mean(y))
## [1] 68.68
N <- length(y)
Y<- matrix(y,N,1)
A <- matrix(1,N,1)
barY=t(A)%*%Y / N
print(barY)
## [,1]
## [1,] 68.68
As we will see later, multiplying the transpose of a matrix with another is very common in statistics. In fact, it is so common that there is a function in R:
barY=crossprod(A,Y) / N
print(barY)
## [,1]
## [1,] 68.68
For the variance, we note that if:
\[ \mathbf{r}\equiv \begin{pmatrix} Y_1 - \bar{Y}\\ \vdots\\ Y_N - \bar{Y} \end{pmatrix}, \,\, \frac{1}{N} \mathbf{r}^\top\mathbf{r} = \frac{1}{N}\sum_{i=1}^N (Y_i - \bar{Y})^2 \]
In R, if you only send one matrix into crossprod, it
computes: \(r^\top r\) so we can simply
type:
r <- y - barY
## Warning in y - barY: Recycling array of length 1 in vector-array arithmetic is deprecated.
## Use c() or as.vector() instead.
crossprod(r)/N
## [,1]
## [1,] 7.915
Which is almost equivalent to:
library(rafalib)
popvar(y)
## [1] 7.915
Now we are ready to put all this to use. Let’s start with Galton’s example. If we define these matrices:
\[ \mathbf{Y} = \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} , \mathbf{X} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} , \mathbf{\beta} = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \mbox{ and } \mathbf{\varepsilon} = \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]
Then we can write the model:
\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, i=1,\dots,N \]
as:
\[ \, \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_N \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N \end{pmatrix} \]
or simply:
\[ \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \]
which is a much simpler way to write it.
The least squares equation becomes simpler as well since it is the following cross-product:
\[ (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^\top (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}) \]
So now we are ready to determine which values of \(\beta\) minimize the above, which we can do using calculus to find the minimum.
There are a series of rules that permit us to compute partial derivative equations in matrix notation. By equating the derivative to 0 and solving for the \(\beta\), we will have our solution. The only one we need here tells us that the derivative of the above equation is:
\[ 2 \mathbf{X}^\top (\mathbf{Y} - \mathbf{X} \boldsymbol{\hat{\beta}})=0 \]
\[ \mathbf{X}^\top \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}^\top \mathbf{Y} \]
\[ \boldsymbol{\hat{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]
and we have our solution. We usually put a hat on the \(\beta\) that solves this, \(\hat{\beta}\) , as it is an estimate of the “real” \(\beta\) that generated the data.
Remember that the least squares are like a square (multiply something by itself) and that this formula is similar to the derivative of \(f(x)^2\) being \(2f(x)f\prime (x)\).
Let’s see how it works in R:
data(father.son,package="UsingR")
x=father.son$fheight
y=father.son$sheight
X <- cbind(1,x)
betahat <- solve( t(X) %*% X ) %*% t(X) %*% y
###or
betahat <- solve( crossprod(X) ) %*% crossprod( X, y )
Now we can see the results of this by computing the estimated \(\hat{\beta}_0+\hat{\beta}_1 x\) for any value of \(x\):
newx <- seq(min(x),max(x),len=100)
X <- cbind(1,newx)
fitted <- X%*%betahat
plot(x,y,xlab="Father's height",ylab="Son's height")
lines(newx,fitted,col=2)
Galton’s data with fitted regression line.
This \(\hat{\boldsymbol{\beta}}=(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}\) is one of the most widely used results in data analysis. One of the advantages of this approach is that we can use it in many different situations. For example, in our falling object problem:
set.seed(1)
g <- 9.8 #meters per second
n <- 25
tt <- seq(0,3.4,len=n) #time in secs, t is a base function
d <- 56.67 - 0.5*g*tt^2 + rnorm(n,sd=1)
Notice that we are using almost the same exact code:
X <- cbind(1,tt,tt^2)
y <- d
betahat <- solve(crossprod(X))%*%crossprod(X,y)
newtt <- seq(min(tt),max(tt),len=100)
X <- cbind(1,newtt,newtt^2)
fitted <- X%*%betahat
plot(tt,y,xlab="Time",ylab="Height")
lines(newtt,fitted,col=2)
Fitted parabola to simulated data for distance travelled versus time of falling object measured with error.
And the resulting estimates are what we expect:
betahat
## [,1]
## 56.5317
## tt 0.5014
## -5.0386
The Tower of Pisa is about 56 meters high. Since we are just dropping the object there is no initial velocity, and half the constant of gravity is 9.8/2=4.9 meters per second squared.
lm FunctionR has a very convenient function that fits these models. We will learn more about this function later, but here is a preview:
X <- cbind(tt,tt^2)
fit=lm(y~X)
summary(fit)
##
## Call:
## lm(formula = y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.530 -0.488 0.254 0.656 1.546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.532 0.545 103.70 <2e-16 ***
## Xtt 0.501 0.743 0.68 0.51
## X -5.039 0.211 -23.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.982 on 22 degrees of freedom
## Multiple R-squared: 0.997, Adjusted R-squared: 0.997
## F-statistic: 4.02e+03 on 2 and 22 DF, p-value: <2e-16
Note that we obtain the same values as above.
We have shown how to write linear models using linear algebra. We are going to do this for several examples, many of which are related to designed experiments. We also demonstrated how to obtain least squares estimates. Nevertheless, it is important to remember that because \(y\) is a random variable, these estimates are random as well. In a later section, we will learn how to compute standard error for these estimates and use this to perform inference.
Here we introduce the basics of matrix notation. Initially this may seem over-complicated, but once we discuss examples, you will appreciate the power of using this notation to both explain and derive solutions, as well as implement them as R code.
Linear algebra notation actually simplifies the mathematical descriptions and manipulations of linear models, as well as coding in R. We will discuss the basics of this notation and then show some examples in R.
The main point of this entire exercise is to show how we can write the models above using matrix notation, and then explain how this is useful for solving the least squares equation. We start by simply defining notation and matrix multiplication, but bear with us since we eventually get back to the practical application.
Linear algebra was created by mathematicians to solve systems of linear equations such as this:
\[ \begin{align*} a + b + c &= 6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 1 \end{align*} \]
It provides very useful machinery to solve these problems generally. We will learn how we can write and solve this system using matrix algebra notation:
\[ \, \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix} \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \implies \begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 1&1&1\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}^{-1} \begin{pmatrix} 6\\ 2\\ 1 \end{pmatrix} \]
This section explains the notation used above. It turns out that we can borrow this notation for linear models in statistics as well.
In the falling object, father-son heights, and mouse weight examples, the random variables associated with the data were represented by \(Y_1,\dots,Y_n\). We can think of this as a vector. In fact, in R we are already doing this:
data(father.son,package="UsingR")
y=father.son$fheight
head(y)
## [1] 65.05 63.25 64.96 65.75 61.14 63.02
In math we can also use just one symbol. We usually use bold to distinguish it from the individual entries:
\[ \mathbf{Y} = \begin{pmatrix} Y_1\\\ Y_2\\\ \vdots\\\ Y_N \end{pmatrix} \]
For reasons that will soon become clear, default representation of data vectors have dimension \(N\times 1\) as opposed to \(1 \times N\) .
Here we don’t always use bold because normally one can tell what is a matrix from the context.
Similarly, we can use math notation to represent the covariates or predictors. In a case with two predictors we can represent them like this:
\[ \mathbf{X}_1 = \begin{pmatrix} x_{1,1}\\ \vdots\\ x_{N,1} \end{pmatrix} \mbox{ and } \mathbf{X}_2 = \begin{pmatrix} x_{1,2}\\ \vdots\\ x_{N,2} \end{pmatrix} \]
Note that for the falling object example \(x_{1,1}= t_i\) and \(x_{i,1}=t_i^2\) with \(t_i\) the time of the i-th observation. Also, keep in mind that vectors can be thought of as \(N\times 1\) matrices.
For reasons that will soon become apparent, it is convenient to represent these in matrices:
\[ \mathbf{X} = [ \mathbf{X}_1 \mathbf{X}_2 ] = \begin{pmatrix} x_{1,1}&x_{1,2}\\ \vdots\\ x_{N,1}&x_{N,2} \end{pmatrix} \]
This matrix has dimension \(N \times 2\). We can create this matrix in R this way:
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
X <- cbind(X1=tt,X2=tt^2)
head(X)
## X1 X2
## [1,] 0.0000 0.00000
## [2,] 0.1417 0.02007
## [3,] 0.2833 0.08028
## [4,] 0.4250 0.18062
## [5,] 0.5667 0.32111
## [6,] 0.7083 0.50174
dim(X)
## [1] 25 2
We can also use this notation to denote an arbitrary number of covariates with the following \(N\times p\) matrix:
\[ \mathbf{X} = \begin{pmatrix} x_{1,1}&\dots & x_{1,p} \\ x_{2,1}&\dots & x_{2,p} \\ & \vdots & \\ x_{N,1}&\dots & x_{N,p} \end{pmatrix} \]
Just as an example, we show you how to make one in R now using
matrix instead of cbind:
N <- 100; p <- 5
X <- matrix(1:(N*p),N,p)
head(X)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 101 201 301 401
## [2,] 2 102 202 302 402
## [3,] 3 103 203 303 403
## [4,] 4 104 204 304 404
## [5,] 5 105 205 305 405
## [6,] 6 106 206 306 406
dim(X)
## [1] 100 5
By default, the matrices are filled column by column. The
byrow=TRUE argument lets us change that to row by row:
N <- 100; p <- 5
X <- matrix(1:(N*p),N,p,byrow=TRUE)
head(X)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
## [4,] 16 17 18 19 20
## [5,] 21 22 23 24 25
## [6,] 26 27 28 29 30
Finally, we define a scalar. A scalar is just a number, which we call a scalar because we want to distinguish it from vectors and matrices. We usually use lower case and don’t bold. In the next section, we will understand why we make this distinction.
This book focuses on teaching statistical concepts and data analysis programming skills. We avoid mathematical notation as much as possible, but we do use it. We do not want readers to be intimidated by the notation though. Mathematics is actually the easier part of learning statistics. Unfortunately, many text books use mathematical notation in what we believe to be an over-complicated way. For this reason, we do try to keep the notation as simple as possible. However, we do not want to water down the material, and some mathematical notation facilitates a deeper understanding of the concepts. Here we describe a few specific symbols that we use often. If they appear intimidating to you, please take some time to read this section carefully as they are actually simpler than they seem. Because by now you should be somewhat familiar with R, we will make the connection between mathematical notation and R code.
Those of us dealing with data almost always have a series of numbers. To describe the concepts in an abstract way, we use indexing. For example 5 numbers:
x <- 1:5
can be generally represented like this \(x_1, x_2, x_3, x_4, x_5\). We use dots to simplify this \(x_1,\dots,x_5\) and indexing to simplify even more \(x_i, i=1,\dots,5\). If we want to describe a procedure for a list of any size \(n\), we write \(x_i, i=1,\dots,n\).
We sometimes have two indexes. For example, we may have several measurements (blood pressure, weight, height, age, cholesterol level) for 100 individuals. We can then use double indexes: \(x_{i,j}, i=1,\dots,100, j=1,\dots,5\).
A very common operation in data analysis is to sum several numbers. This comes up, for example, when we compute averages and standard deviations. If we have many numbers, there is a mathematical notation that makes it quite easy to express the following:
n <- 1000
x <- 1:n
S <- sum(x)
and it is the \(\sum\) notation (capital S in Greek):
\[ S = \sum_{i=1}^n x_i \]
Note that we make use of indexing as well. We will see that what is included inside the summation can become quite complicated. However, the summation part should not confuse you as it is a simple operation.
We would prefer to avoid Greek letters, but they are ubiquitous in the statistical literature so we want you to become used to them. They are mainly used to distinguish the unknown from the observed. Suppose we want to find out the average height of a population and we take a sample of 1,000 people to estimate this. The unknown average we want to estimate is often denoted with \(\mu\), the Greek letter for m (m is for mean). The standard deviation is often denoted with \(\sigma\), the Greek letter for s. Measurement error or other unexplained random variability is typically denoted with \(\varepsilon\), the Greek letter for e. Effect sizes, for example the effect of a diet on weight, are typically denoted with \(\beta\). We may use other Greek letters but those are the most commonly used.
You should get used to these four Greek letters as you will be seeing them often: \(\mu\), \(\sigma\), \(\beta\) and \(\varepsilon\).
Note that indexing is sometimes used in conjunction with Greek letters to denote different groups. For example, if we have one set of numbers denoted with \(x\) and another with \(y\) we may use \(\mu_x\) and \(\mu_y\) to denote their averages.
In the text we often talk about asymptotic results. Typically, this refers to an approximation that gets better and better as the number of data points we consider gets larger and larger, with perfect approximations occurring when the number of data points is \(\infty\). In practice, there is no such thing as \(\infty\), but it is a convenient concept to understand. One way to think about asymptotic results is as results that become better and better as some number increases and we can pick a number so that a computer can’t tell the difference between the approximation and the real number. Here is a very simple example that approximates 1/3 with decimals:
onethird <- function(n) sum( 3/10^c(1:n))
1/3 - onethird(4)
## [1] 3.333e-05
1/3 - onethird(10)
## [1] 3.333e-11
1/3 - onethird(16)
## [1] 0
In the example above, 16 is practically \(\infty\).
We only use these a couple of times so you can skip this section if you prefer. However, integrals are actually much simpler to understand than perhaps you realize.
For certain statistical operations, we need to figure out areas under the curve. For example, for a function \(f(x)\) …
Integral of a function.
…we need to know what proportion of the total area under the curve is grey.
The grey area can be thought of as many small grey bars stacked next to each other. The area is then just the sum of the areas of these little bars. The problem is that we can’t do this for every number between 2 and 4 because there are an infinite number. The integral is the mathematical solution to this problem. In this case, the total area is 1 so the answer to what proportion is grey is the following integral:
\[ \int_2^4 f(x) \, dx \]
Because we constructed this example, we know that the grey area is 2.27% of the total. Note that this is very well approximated by an actual sum of little bars:
width <- 0.01
x <- seq(2,4,width)
areaofbars <- f(x)*width
sum( areaofbars )
## [1] 0.02299
The smaller we make width, the closer the sum gets to
the integral, which is equal to the area.
Suppose we were given high-throughput gene expression data that was measured for several individuals in two populations. We are asked to report which genes have different average expression levels in the two populations. If instead of thousands of genes, we were handed data from just one gene, we could simply apply the inference techniques that we have learned before. We could, for example, use a t-test or some other test. Here we review what changes when we consider high-throughput data.
An important concept to remember in order to understand the concepts presented in this chapter is that p-values are random variables. To see this, consider the example in which we define a p-value from a t-test with a large enough sample size to use the CLT approximation. Then our p-value is defined as the probability that a normally distributed random variable is larger, in absolute value, than the observed t-test, call it \(Z\). So for a two sided test the p-value is:
\[ p = 2 \{ 1 - \Phi(\mid Z \mid)\} \]
In R, we write:
2*( 1-pnorm( abs(Z) ) )
Now because \(Z\) is a random
variable and \(\Phi\) is a
deterministic function, \(p\) is also a
random variable. We will create a Monte Carlo simulation showing how the
values of \(p\) change. We use
femaleControlsPopulation.csv from earlier chapters.
We read in the data, and use replicate to repeatedly
create p-values.
set.seed(1)
population = unlist( read.csv(filename) )
N <- 12
B <- 10000
pvals <- replicate(B,{
control = sample(population,N)
treatment = sample(population,N)
t.test(treatment,control)$p.val
})
hist(pvals)
P-value histogram for 10,000 tests in which null hypothesis is true.
As implied by the histogram, in this case the distribution of the p-value is uniformly distributed. In fact, we can show theoretically that when the null hypothesis is true, this is always the case. For the case in which we use the CLT, we have that the null hypothesis \(H_0\) implies that our test statistic \(Z\) follows a normal distribution with mean 0 and SD 1 thus:
\[ p_a = \mbox{Pr}(Z < a \mid H_0) = \Phi(a) \]
This implies that:
\[ \begin{align*} \mbox{Pr}(p < p_a) &= \mbox{Pr}[ \Phi^{-1}(p) < \Phi^{-1}(p_a) ] \\ & = \mbox{Pr}(Z < a) = p_a \end{align*} \]
which is the definition of a uniform distribution.
In this data we have two groups denoted with 0 and 1:
If we were interested in a particular gene, let’s arbitrarily pick the one on the 25th row, we would simply compute a t-test. To compute a p-value, we will use the t-distribution approximation and therefore we need the population data to be approximately normal. We check this assumption with a qq-plot:
The qq-plots show that the data is well approximated by the normal approximation. The t-test does not find this gene to be statistically significant:
To answer the question for each gene, we simply repeat the above for
each gene. Here we will define our own function and use
apply:
We can now see which genes have p-values less than, say, 0.05. For example, right away we see that…
… genes had p-values less than 0.05.
However, as we will describe in more detail below, we have to be careful in interpreting this result because we have performed over 8,000 tests. If we performed the same procedure on random data, for which the null hypothesis is true for all features, we obtain the following results:
As we will explain later in the chapter, this is to be expected: 419 is roughly 0.05*8192 and we will describe the theory that tells us why this prediction works.
Before we continue, we should point out that the above implementation is very inefficient. There are several faster implementations that perform t-test for high-throughput data. We make use of a function that is not available from CRAN, but rather from the Bioconductor project.
To download and install packages from Bioconductor, we can use the
install_bioc function in rafalib to install
the package:
install_bioc("genefilter")
In this book we try to minimize mathematical notation as much as possible. Furthermore, we avoid using calculus to motivate statistical concepts. However, Matrix Algebra (also referred to as Linear Algebra) and its mathematical notation greatly facilitates the exposition of the advanced data analysis techniques covered in the remainder of this book. We therefore dedicate a chapter of this book to introducing Matrix Algebra. We do this in the context of data analysis and using one of the main applications: Linear Models.
We will describe three examples from the life sciences: one from physics, one related to genetics, and one from a mouse experiment. They are very different, yet we end up using the same statistical technique: fitting linear models. Linear models are typically taught and described in the language of matrix algebra.
Imagine you are Galileo in the 16th century trying to describe the velocity of a falling object. An assistant climbs the Tower of Pisa and drops a ball, while several other assistants record the position at different times. Let’s simulate some data using the equations we know today and adding some measurement error:
set.seed(1)
g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, note: we use tt because t is a base function
d <- 56.67 - 0.5*g*tt^2 + rnorm(n,sd=1) ##meters
The assistants hand the data to Galileo and this is what he sees:
mypar()
plot(tt,d,ylab="Distance in meters",xlab="Time in seconds")
Simulated data for distance travelled versus time of falling object measured with error.
He does not know the exact equation, but by looking at the plot above he deduces that the position should follow a parabola. So he models the data with:
\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i, i=1,\dots,n \]
With \(Y_i\) representing location, \(x_i\) representing the time, and \(\varepsilon_i\) accounting for measurement error. This is a linear model because it is a linear combination of known quantities (the \(x\)’s) referred to as predictors or covariates and unknown parameters (the \(\beta\)’s).
Now imagine you are Francis Galton in the 19th century and you collect paired height data from fathers and sons. You suspect that height is inherited. Your data:
data(father.son,package="UsingR")
x=father.son$fheight
y=father.son$sheight
looks like this:
plot(x,y,xlab="Father's height",ylab="Son's height")
Galton’s data. Son heights versus father heights.
The sons’ heights do seem to increase linearly with the fathers’ heights. In this case, a model that describes the data is as follows:
\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, i=1,\dots,N \]
This is also a linear model with \(x_i\) and \(Y_i\), the father and son heights respectively, for the \(i\)-th pair and \(\varepsilon_i\) a term to account for the extra variability. Here we think of the fathers’ heights as the predictor and being fixed (not random) so we use lower case. Measurement error alone can’t explain all the variability seen in \(\varepsilon_i\). This makes sense as there are other variables not in the model, for example, mothers’ heights, genetic randomness, and environmental factors.
Here we read-in mouse body weight data from mice that were fed two different diets: high fat and control (chow). We have a random sample of 12 mice for each. We are interested in determining if the diet has an effect on weight. Here is the data:
We want to estimate the difference in average weight between populations. We demonstrated how to do this using t-tests and confidence intervals, based on the difference in sample averages. We can obtain the same exact results using a linear model:
\[ Y_i = \beta_0 + \beta_1 x_{i} + \varepsilon_i\]
with \(\beta_0\) the chow diet average weight, \(\beta_1\) the difference between averages, \(x_i = 1\) when mouse \(i\) gets the high fat (hf) diet, \(x_i = 0\) when it gets the chow diet, and \(\varepsilon_i\) explains the differences between mice of the same population.
We have seen three very different examples in which linear models can be used. A general model that encompasses all of the above examples is the following:
\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \dots + \beta_2 x_{i,p} + \varepsilon_i, i=1,\dots,n \]
\[ Y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{i,j} + \varepsilon_i, i=1,\dots,n \]
Note that we have a general number of predictors \(p\). Matrix algebra provides a compact language and mathematical framework to compute and make derivations with any linear model that fits into the above framework.
For the models above to be useful we have to estimate the unknown \(\beta\) s. In the first example, we want to describe a physical process for which we can’t have unknown parameters. In the second example, we better understand inheritance by estimating how much, on average, the father’s height affects the son’s height. In the final example, we want to determine if there is in fact a difference: if \(\beta_1 \neq 0\).
The standard approach in science is to find the values that minimize the distance of the fitted model to the data. The following is called the least squares (LS) equation and we will see it often in this chapter:
\[ \sum_{i=1}^n \left\{ Y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j x_{i,j}\right)\right\}^2 \]
Once we find the minimum, we will call the values the least squares estimates (LSE) and denote them with \(\hat{\beta}\). The quantity obtained when evaluating the least squares equation at the estimates is called the residual sum of squares (RSS). Since all these quantities depend on \(Y\), they are random variables. The \(\hat{\beta}\) s are random variables and we will eventually perform inference on them.
Thanks to my high school physics teacher, I know that the equation for the trajectory of a falling object is:
\[d = h_0 + v_0 t - 0.5 \times 9.8 t^2\]
with \(h_0\) and \(v_0\) the starting height and velocity
respectively. The data we simulated above followed this equation and
added measurement error to simulate n observations for
dropping the ball \((v_0=0)\) from the
tower of Pisa \((h_0=56.67)\). This is
why we used this code to simulate data:
g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
f <- 56.67 - 0.5*g*tt^2
y <- f + rnorm(n,sd=1)
Here is what the data looks like with the solid line representing the true trajectory:
plot(tt,y,ylab="Distance in meters",xlab="Time in seconds")
lines(tt,f,col=2)
Fitted model for simulated data for distance travelled versus time of falling object measured with error.
But we were pretending to be Galileo and so we don’t know the parameters in the model. The data does suggest it is a parabola, so we model it as such:
\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i, i=1,\dots,n \]
How do we find the LSE?
lm functionIn R we can fit this model by simply using the lm
function. We will describe this function in detail later, but here is a
preview:
tt2 <-tt^2
fit <- lm(y~tt+tt2)
summary(fit)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.4306 0.3969 142.1654 4.227e-34
## tt 0.1468 0.5407 0.2714 7.886e-01
## tt2 -4.8944 0.1536 -31.8627 6.639e-20
It gives us the LSE, as well as standard errors and p-values.
Part of what we do in this section is to explain the mathematics behind this function.
Let’s write a function that computes the RSS for any vector \(\beta\):
rss <- function(Beta0,Beta1,Beta2){
r <- y - (Beta0+Beta1*tt+Beta2*tt^2)
return(sum(r^2))
}
So for any three dimensional vector we get an RSS. Here is a plot of the RSS as a function of \(\beta_2\) when we keep the other two fixed:
Beta2s<- seq(-10,0,len=100)
plot(Beta2s,sapply(Beta2s,rss,Beta0=55,Beta1=0),
ylab="RSS",xlab="Beta2",type="l")
##Let's add another curve fixing another pair:
Beta2s<- seq(-10,0,len=100)
lines(Beta2s,sapply(Beta2s,rss,Beta0=65,Beta1=0),col=2)
Residual sum of squares obtained for several values of the parameters.
Trial and error here is not going to work. Instead, we can use calculus: take the partial derivatives, set them to 0 and solve. Of course, if we have many parameters, these equations can get rather complex. Linear algebra provides a compact and general way of solving this problem.
When studying the father-son data, Galton made a fascinating discovery using exploratory analysis.
He noted that if he tabulated the number of father-son height pairs and followed all the x,y values having the same totals in the table, they formed an ellipse. In the plot above, made by Galton, you see the ellipse formed by the pairs having 3 cases. This then led to modeling this data as correlated bivariate normal which we described earlier:
\[ Pr(X<a,Y<b) = \]
\[ \int_{-\infty}^{a} \int_{-\infty}^{b} \frac{1}{2\pi\sigma_x\sigma_y\sqrt{1-\rho^2}} \exp{ \left\{ \frac{1}{2(1-\rho^2)} \left[\left(\frac{x-\mu_x}{\sigma_x}\right)^2 - 2\rho\left(\frac{x-\mu_x}{\sigma_x}\right)\left(\frac{y-\mu_y}{\sigma_y}\right)+ \left(\frac{y-\mu_y}{\sigma_y}\right)^2 \right] \right\} } \]
We described how we can use math to show that if you keep \(X\) fixed (condition to be \(x\)) the distribution of \(Y\) is normally distributed with mean: \(\mu_x +\sigma_y \rho \left(\frac{x-\mu_x}{\sigma_x}\right)\) and standard deviation \(\sigma_y \sqrt{1-\rho^2}\). Note that \(\rho\) is the correlation between \(Y\) and \(X\), which implies that if we fix \(X=x\), \(Y\) does in fact follow a linear model. The \(\beta_0\) and \(\beta_1\) parameters in our simple linear model can be expressed in terms of \(\mu_x,\mu_y,\sigma_x,\sigma_y\), and \(\rho\).
Linear models can be extended in many directions. Here are some examples of extensions, which you might come across in analyzing data in the life sciences:
In calculating the solution and its estimated error in the standard linear model, we minimize the squared errors. This involves a sum of squares from all the data points, which means that a few outlier data points can have a large influence on the solution. In addition, the errors are assumed to have constant variance (called homoskedasticity), which might not always hold true (when this is not true, it is called heteroskedasticity). Therefore, methods have been developed to generate more robust solutions, which behave well in the presence of outliers, or when the distributional assumptions are not met. A number of these are mentioned on the robust statistics page on the CRAN website. For more background, there is also a Wikipedia article with references.
In the standard linear model, we did not make any assumptions about the distribution of \(\mathbf{Y}\), though in some cases we can gain better estimates if we know that \(\mathbf{Y}\) is, for example, restricted to non-negative integers \(0,1,2,\dots\), or restricted to the interval \([0,1]\). A framework for analyzing such cases is referred to as generalized linear models, commonly abbreviated as GLMs. The two key components of the GLM are the link function and a probability distribution. The link function \(g\) connects our familiar matrix product \(\mathbf{X} \boldsymbol{\beta}\) to the \(\mathbf{Y}\) values through:
\[ \textrm{E}(\mathbf{Y}) = g^{-1}( \mathbf{X} \boldsymbol{\beta} ) \]
R includes the function glm which fits GLMs and uses a
familiar form as lm. Additional arguments include
family, which can be used to specify the distributional
assumption for \(\mathbf{Y}\). Some
examples of the use of GLMs are shown at the Quick R website.
There are a number of references for GLMs on the Wikipedia
page.
In the standard linear model, we assumed that the matrix \(\mathbf{X}\) was fixed and not random. For example, we measured the frictional coefficients for each leg pair, and in the push and pull direction. The fact that an observation had a \(1\) for a given column in \(\mathbf{X}\) was not random, but dictated by the experimental design. However, in the father and son heights example, we did not fix the values of the fathers’ heights, but observed these (and likely these were measured with some error). A framework for studying the effect of the randomness for various columns in \(X\) is referred to as mixed effects models, which implies that some effects are fixed and some effects are random. One of the most popular packages in R for fitting linear mixed effects models is lme4 which has an accompanying paper on Fitting Linear Mixed-Effects Models using lme4. There is also a Wikipedia page with more references.
The approach presented here assumed \(\boldsymbol{\beta}\) was a fixed (non-random) parameter. We presented methodology that estimates this parameter, along with standard errors that quantify uncertainty, in the estimation process. This is referred to as the frequentist approach. An alternative approach is to assume that \(\boldsymbol{\beta}\) is random and its distribution quantifies our prior beliefs about what \(\boldsymbol{\beta}\) should be. Once we have observed data, then we update our prior beliefs by computing the conditional distribution, referred to as the posterior distribution, of \(\boldsymbol{\beta}\) given the data. This is referred to as the Bayesian approach. For example, once we have computed the posterior distribution of \(\boldsymbol{\beta}\) we can report the most likely outcome of an interval that occurs with high probability (credible interval). In addition, many models can be connected together in what is referred to as a hierarchical model. Note that we provide a brief introduction to Bayesian statistics and hierarchical models in a later chapter. A good reference for Bayesian hierarchical models is Bayesian Data Analysis, and some software for computing Bayesian linear models can be found on the Bayes page on CRAN. Some well known software for computing Bayesian models are stan and BUGS.
Note that if we include enough parameters in a model we can achieve a residual sum of squares of 0. Penalized linear models introduce a penalty term to the least square equation we minimize. These penalities are typically of the form, \(\lambda \sum_{j=1}^p \|\beta_j\|^k\) and they penalize for large absolute values of \(\beta\) as well as large numbers of parameters. The motivation for this extra term is to avoid over-fitting. To use these models, we need to pick \(\lambda\) which determines how much we penalize. When \(k=2\), this is referred to as ridge regression, Tikhonov regularization, or L2 regularization. When \(k=1\), this is referred to as LASSO or L1 regularization. A good reference for these penalized linear models is the Elements of Statistical Learning textbook, which is available as a free pdf. Some R packages which implement penalized linear models are the lm.ridge function in the MASS package, the lars package, and the glmnet package.
Here we give a brief introduction to the main task of machine learning: class prediction. In fact, many refer to class prediction as machine learning and we sometimes use the two terms interchangeably. We give a very brief introduction to this vast topic, focusing on some specific examples.
Some of the examples we give here are motivated by those in the excellent textbook The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor Hastie, Robert Tibshirani and Jerome Friedman, which can be found here.
Similar to inference in the context of regression, Machine Learning (ML) studies the relationships between outcomes \(Y\) and covariates \(X\). In ML, we call \(X\) the predictors or features. The main difference between ML and inference is that, in ML, we are interested mainly in predicting \(Y\) using \(X\). Statistical models are used, but while in inference we estimate and interpret model parameters, in ML they are mainly a means to an end: predicting \(Y\).
Here we introduce the main concepts needed to understand ML, along with two specific algorithms: regression and k nearest neighbors (kNN). Keep in mind that there are dozens of popular algorithms that we do not cover here.
In a previous section, we covered the very simple one-predictor case. However, most of ML is concerned with cases with more than one predictor. For illustration purposes, we move to a case in which \(X\) is two dimensional and \(Y\) is binary. We simulate a situation with a non-linear relationship using an example from the Hastie, Tibshirani and Friedman book. In the plot below, we show the actual values of \(f(x_1,x_2)=E(Y \mid X_1=x_1,X_2=x_2)\) using colors. The following code is used to create a relatively complex conditional probability function. We create the test and train data we use later (code not shown). Here is the plot of \(f(x_1,x_2)\) with red representing values close to 1, blue representing values close to 0, and yellow values in between.
Probability of Y=1 as a function of X1 and X2. Red is close to 1, yellow close to 0.5, and blue close to 0.
If we show points for which \(E(Y \mid X=x)>0.5\) in red and the rest in blue, we see the boundary region that denotes the boundary in which we switch from predicting 0 to 1.
Bayes rule. The line divides part of the space for which probability is larger than 0.5 (red) and lower than 0.5 (blue).
The above plots relate to the “truth” that we do not get to see. Most ML methodology is concerned with estimating \(f(x)\). A typical first step is usually to consider a sample, referred to as the training set, to estimate \(f(x)\). We will review two specific ML techniques. First, we need to review the main concept we use to evaluate the performance of these methods.
In the code (not shown) for the first plot in this chapter, we created a test and a training set. We plot them here:
#x, test, cols, and coltest were created in code that was not shown
#x is training x1 and x2, test is test x1 and x2
#cols (0=blue, 1=red) are training observations
#coltests are test observations
mypar(1,2)
plot(x,pch=21,bg=cols,xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
plot(test,pch=21,bg=colstest,xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
Training data (left) and test data (right).
You will notice that the test and train set have similar global properties since they were generated by the same random variables (more blue towards the bottom right), but are, by construction, different. The reason we create test and training sets is to detect over-training by testing on a different data than the one used to fit models or train algorithms. We will see how important this is below.
A first naive approach to this ML problem is to fit a two variable linear regression model:
##x and y were created in the code (not shown) for the first plot
#y is outcome for the training set
X1 <- x[,1] ##these are the covariates
X2 <- x[,2]
fit1 <- lm(y~X1+X2)
Once we the have fitted values, we can estimate \(f(x_1,x_2)\) with \(\hat{f}(x_1,x_2)=\hat{\beta}_0 + \hat{\beta}_1x_1 +\hat{\beta}_2 x_2\). To provide an actual prediction, we simply predict 1 when \(\hat{f}(x_1,x_2)>0.5\). We now examine the error rates in the test and training sets and also plot the boundary region:
##prediction on train
yhat <- predict(fit1)
yhat <- as.numeric(yhat>0.5)
cat("Linear regression prediction error in train:",1-mean(yhat==y),"\n")
## Linear regression prediction error in train: 0.2975
We can quickly obtain predicted values for any set of values using
the predict function:
yhat <- predict(fit1,newdata=data.frame(X1=newx[,1],X2=newx[,2]))
Now we can create a plot showing where we predict 1s and where we
predict 0s, as well as the boundary. We can also use the
predict function to obtain predicted values for our test
set. Note that nowhere do we fit the model on the test set:
colshat <- yhat
colshat[yhat>=0.5] <- mycols[2]
colshat[yhat<0.5] <- mycols[1]
m <- -fit1$coef[2]/fit1$coef[3] #boundary slope
b <- (0.5 - fit1$coef[1])/fit1$coef[3] #boundary intercept
##prediction on test
yhat <- predict(fit1,newdata=data.frame(X1=test[,1],X2=test[,2]))
yhat <- as.numeric(yhat>0.5)
cat("Linear regression prediction error in test:",1-mean(yhat==ytest),"\n")
## Linear regression prediction error in test: 0.32
plot(test,type="n",xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
abline(b,m)
points(newx,col=colshat,pch=16,cex=0.35)
##test was created in the code (not shown) for the first plot
points(test,bg=cols,pch=21)
We estimate the probability of 1 with a linear regression model with X1 and X2 as predictors. The resulting prediction map is divided into parts that are larger than 0.5 (red) and lower than 0.5 (blue).
The error rates in the test and train sets are quite similar. Thus, we do not seem to be over-training. This is not surprising as we are fitting a 2 parameter model to 400 data points. However, note that the boundary is a line. Because we are fitting a plane to the data, there is no other option here. The linear regression method is too rigid. The rigidity makes it stable and avoids over-training, but it also keeps the model from adapting to the non-linear relationship between \(Y\) and \(X\). We saw this before in the smoothing section. The next ML technique we consider is similar to the smoothing techniques described before.
K-nearest neighbors (kNN) is similar to bin smoothing, but it is easier to adapt to multiple dimensions. Basically, for any point \(x\) for which we want an estimate, we look for the k nearest points and then take an average of these points. This gives us an estimate of \(f(x_1,x_2)\), just like the bin smoother gave us an estimate of a curve. We can now control flexibility through \(k\). Here we compare \(k=1\) and \(k=100\).
library(class)
mypar(2,2)
for(k in c(1,100)){
##predict on train
yhat <- knn(x,x,y,k=k)
cat("KNN prediction error in train:",1-mean((as.numeric(yhat)-1)==y),"\n")
##make plot
yhat <- knn(x,test,y,k=k)
cat("KNN prediction error in test:",1-mean((as.numeric(yhat)-1)==ytest),"\n")
}
## KNN prediction error in train: 0
## KNN prediction error in test: 0.3375
## KNN prediction error in train: 0.2725
## KNN prediction error in test: 0.3125
To visualize why we make no errors in the train set and many errors in the test set when \(k=1\) and obtain more stable results from \(k=100\), we show the prediction regions (code not shown):
Prediction regions obtained with kNN for k=1 (top) and k=200 (bottom). We show both train (left) and test data (right).
When \(k=1\), we make no mistakes in the training test since every point is its closest neighbor and it is equal to itself. However, we see some islands of blue in the red area that, once we move to the test set, are more error prone. In the case \(k=100\), we do not have this problem and we also see that we improve the error rate over linear regression. We can also see that our estimate of \(f(x_1,x_2)\) is closer to the truth.
Here we include a comparison of the test and train set errors for various values of \(k\). We also include the error rate that we would make if we actually knew \(\mbox{E}(Y \mid X_1=x1,X_2=x_2)\) referred to as Bayes Rule.
We start by computing the error rates…
###Bayes Rule
yhat <- apply(test,1,p)
cat("Bayes rule prediction error in train",1-mean(round(yhat)==y),"\n")
## Bayes rule prediction error in train 0.2825
bayes.error=1-mean(round(yhat)==y)
train.error <- rep(0,16)
test.error <- rep(0,16)
for(k in seq(along=train.error)){
##predict on train
yhat <- knn(x,x,y,k=2^(k/2))
train.error[k] <- 1-mean((as.numeric(yhat)-1)==y)
##prediction on test
yhat <- knn(x,test,y,k=2^(k/2))
test.error[k] <- 1-mean((as.numeric(yhat)-1)==y)
}
… and then plot the error rates against values of \(k\). We also show the Bayes rules error rate as a horizontal line.
ks <- 2^(seq(along=train.error)/2)
mypar()
plot(ks,train.error,type="n",xlab="K",ylab="Prediction Error",log="x",
ylim=range(c(test.error,train.error)))
lines(ks,train.error,type="b",col=4,lty=2,lwd=2)
lines(ks,test.error,type="b",col=5,lty=3,lwd=2)
abline(h=bayes.error,col=6)
legend("bottomright",c("Train","Test","Bayes"),col=c(4,5,6),lty=c(2,3,1),box.lwd=0)
Prediction error in train (pink) and test (green) versus number of neighbors. The yellow line represents what one obtains with Bayes Rule.
Note that these error rates are random variables and have standard errors. In the next section we describe cross-validation which helps reduce some of this variability. However, even with this variability, the plot clearly shows the problem of over-fitting when using values lower than 20 and under-fitting with values above 100.
We have shown how to find the least squares estimates with matrix algebra. These estimates are random variables since they are linear combinations of the data. For these estimates to be useful, we also need to compute their standard errors. Linear algebra provides a powerful approach for this task. We provide several examples.
It is useful to think about where randomness comes from. In our falling object example, randomness was introduced through measurement errors. Each time we rerun the experiment, a new set of measurement errors will be made. This implies that our data will change randomly, which in turn suggests that our estimates will change randomly. For instance, our estimate of the gravitational constant will change every time we perform the experiment. The constant is fixed, but our estimates are not. To see this we can run a Monte Carlo simulation. Specifically, we will generate the data repeatedly and each time compute the estimate for the quadratic term.
set.seed(1)
B <- 10000
h0 <- 56.67
v0 <- 0
g <- 9.8 ##meters per second
n <- 25
tt <- seq(0,3.4,len=n) ##time in secs, t is a base function
X <-cbind(1,tt,tt^2)
##create X'X^-1 X'
A <- solve(crossprod(X)) %*% t(X)
betahat<-replicate(B,{
y <- h0 + v0*tt - 0.5*g*tt^2 + rnorm(n,sd=1)
betahats <- A%*%y
return(betahats[3])
})
head(betahat)
## [1] -5.039 -4.894 -5.144 -5.221 -5.063 -4.778
As expected, the estimate is different every time. This is because \(\hat{\beta}\) is a random variable. It therefore has a distribution:
library(rafalib)
mypar(1,2)
hist(betahat)
qqnorm(betahat)
qqline(betahat)
Distribution of estimated regression coefficients obtained from Monte Carlo simulated falling object data. The left is a histogram and on the right we have a qq-plot against normal theoretical quantiles.
Since \(\hat{\beta}\) is a linear combination of the data which we made normal in our simulation, it is also normal as seen in the qq-plot above. Also, the mean of the distribution is the true parameter \(-0.5g\), as confirmed by the Monte Carlo simulation performed above.
round(mean(betahat),1)
## [1] -4.9
But we will not observe this exact value when we estimate because the standard error of our estimate is approximately:
sd(betahat)
## [1] 0.213
Here we will show how we can compute the standard error without a Monte Carlo simulation. Since in practice we do not know exactly how the errors are generated, we can’t use the Monte Carlo approach.
In the father and son height examples, we have randomness because we have a random sample of father and son pairs. For the sake of illustration, let’s assume that this is the entire population:
data(father.son,package="UsingR")
x <- father.son$fheight
y <- father.son$sheight
n <- length(y)
Now let’s run a Monte Carlo simulation in which we take a sample size of 50 over and over again.
N <- 50
B <-1000
betahat <- replicate(B,{
index <- sample(n,N)
sampledat <- father.son[index,]
x <- sampledat$fheight
y <- sampledat$sheight
lm(y~x)$coef
})
betahat <- t(betahat) #have estimates in two columns
By making qq-plots, we see that our estimates are approximately normal random variables:
mypar(1,2)
qqnorm(betahat[,1])
qqline(betahat[,1])
qqnorm(betahat[,2])
qqline(betahat[,2])
Distribution of estimated regression coefficients obtained from Monte Carlo simulated father-son height data. The left is a histogram and on the right we have a qq-plot against normal theoretical quantiles.
We also see that the correlation of our estimates is negative:
cor(betahat[,1],betahat[,2])
## [1] -0.9991
When we compute linear combinations of our estimates, we will need to know this information to correctly calculate the standard error of these linear combinations.
In the next section, we will describe the variance-covariance matrix. The covariance of two random variables is defined as follows:
mean( (betahat[,1]-mean(betahat[,1] ))* (betahat[,2]-mean(betahat[,2])))
## [1] -0.9914
The covariance is the correlation multiplied by the standard deviations of each random variable:
\[\mbox{Corr}(X,Y) = \frac{\mbox{Cov}(X,Y)}{\sigma_X \sigma_Y}\]
Other than that, this quantity does not have a useful interpretation in practice. However, as we will see, it is a very useful quantity for mathematical derivations. In the next sections, we show useful matrix algebra calculations that can be used to estimate standard errors of linear model estimates.
As a first step we need to define the variance-covariance matrix, \(\boldsymbol{\Sigma}\). For a vector of random variables, \(\mathbf{Y}\), we define \(\boldsymbol{\Sigma}\) as the matrix with the \(i,j\) entry:
\[ \Sigma_{i,j} \equiv \mbox{Cov}(Y_i, Y_j) \]
The covariance is equal to the variance if \(i = j\) and equal to 0 if the variables are independent. In the kinds of vectors considered up to now, for example, a vector \(\mathbf{Y}\) of individual observations \(Y_i\) sampled from a population, we have assumed independence of each observation and assumed the \(Y_i\) all have the same variance \(\sigma^2\), so the variance-covariance matrix has had only two kinds of elements:
\[ \mbox{Cov}(Y_i, Y_i) = \mbox{var}(Y_i) = \sigma^2\]
\[ \mbox{Cov}(Y_i, Y_j) = 0, \mbox{ for } i \neq j\]
which implies that \(\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}\) with \(\mathbf{I}\), the identity matrix.
Later, we will see a case, specifically the estimate coefficients of a linear model, \(\hat{\boldsymbol{\beta}}\), that has non-zero entries in the off diagonal elements of \(\boldsymbol{\Sigma}\). Furthermore, the diagonal elements will not be equal to a single value \(\sigma^2\).
A useful result provided by linear algebra is that the variance covariance-matrix of a linear combination \(\mathbf{AY}\) of \(\mathbf{Y}\) can be computed as follows:
\[ \mbox{var}(\mathbf{AY}) = \mathbf{A}\mbox{var}(\mathbf{Y}) \mathbf{A}^\top \]
For example, if \(Y_1\) and \(Y_2\) are independent both with variance \(\sigma^2\) then:
\[\mbox{var}\{Y_1+Y_2\} = \mbox{var}\left\{ \begin{pmatrix}1&1\end{pmatrix}\begin{pmatrix} Y_1\\Y_2\\ \end{pmatrix}\right\}\]
\[ =\begin{pmatrix}1&1\end{pmatrix} \sigma^2 \mathbf{I}\begin{pmatrix} 1\\1\\ \end{pmatrix}=2\sigma^2\]
as we expect. We use this result to obtain the standard errors of the LSE (least squares estimate).
Note that \(\boldsymbol{\hat{\beta}}\) is a linear combination of \(\mathbf{Y}\): \(\mathbf{AY}\) with \(\mathbf{A}=\mathbf{(X^\top X)^{-1}X}^\top\), so we can use the equation above to derive the variance of our estimates:
\[\mbox{var}(\boldsymbol{\hat{\beta}}) = \mbox{var}( \mathbf{(X^\top X)^{-1}X^\top Y} ) = \]
\[\mathbf{(X^\top X)^{-1} X^\top} \mbox{var}(Y) (\mathbf{(X^\top X)^{-1} X^\top})^\top = \]
\[\mathbf{(X^\top X)^{-1} X^\top} \sigma^2 \mathbf{I} (\mathbf{(X^\top X)^{-1} X^\top})^\top = \]
\[\sigma^2 \mathbf{(X^\top X)^{-1} X^\top}\mathbf{X} \mathbf{(X^\top X)^{-1}} = \]
\[\sigma^2\mathbf{(X^\top X)^{-1}}\]
The diagonal of the square root of this matrix contains the standard error of our estimates.
To obtain an actual estimate in practice from the formulas above, we need to estimate \(\sigma^2\). Previously we estimated the standard errors from the sample. However, the sample standard deviation of \(Y\) is not \(\sigma\) because \(Y\) also includes variability introduced by the deterministic part of the model: \(\mathbf{X}\boldsymbol{\beta}\). The approach we take is to use the residuals.
We form the residuals like this:
\[ \mathbf{r}\equiv\boldsymbol{\hat{\varepsilon}} = \mathbf{Y}-\mathbf{X}\boldsymbol{\hat{\beta}}\]
Both \(\mathbf{r}\) and \(\boldsymbol{\hat{\varepsilon}}\) notations are used to denote residuals.
Then we use these to estimate, in a similar way, to what we do in the univariate case:
\[ s^2 \equiv \hat{\sigma}^2 = \frac{1}{N-p}\mathbf{r}^\top\mathbf{r} = \frac{1}{N-p}\sum_{i=1}^N r_i^2\]
Here \(N\) is the sample size and \(p\) is the number of columns in \(\mathbf{X}\) or number of parameters (including the intercept term \(\beta_0\)). The reason we divide by \(N-p\) is because mathematical theory tells us that this will give us a better (unbiased) estimate.
Let’s try this in R and see if we obtain the same values as we did with the Monte Carlo simulation above:
n <- nrow(father.son)
N <- 50
index <- sample(n,N)
sampledat <- father.son[index,]
x <- sampledat$fheight
y <- sampledat$sheight
X <- model.matrix(~x)
N <- nrow(X)
p <- ncol(X)
XtXinv <- solve(crossprod(X))
resid <- y - X %*% XtXinv %*% crossprod(X,y)
s <- sqrt( sum(resid^2)/(N-p))
ses <- sqrt(diag(XtXinv))*s
Let’s compare to what lm provides:
summary(lm(y~x))$coef[,2]
## (Intercept) x
## 9.1643 0.1359
ses
## (Intercept) x
## 9.1643 0.1359
They are identical because they are doing the same thing. Also, note that we approximate the Monte Carlo results:
apply(betahat,2,sd)
## (Intercept) x
## 8.2033 0.1211
Frequently, we want to compute the standard deviation of a linear combination of estimates such as \(\hat{\beta}_2 - \hat{\beta}_1\). This is a linear combination of \(\hat{\boldsymbol{\beta}}\):
\[\hat{\beta}_2 - \hat{\beta}_1 = \begin{pmatrix}0&-1&1&0&\dots&0\end{pmatrix} \begin{pmatrix} \hat{\beta}_0\\ \hat{\beta}_1 \\ \hat{\beta}_2 \\ \vdots\\ \hat{\beta}_p \end{pmatrix}\]
Using the above, we know how to compute the variance covariance matrix of \(\hat{\boldsymbol{\beta}}\).
We have shown how we can obtain standard errors for our estimates. However, as we learned in the first chapter, to perform inference we need to know the distribution of these random variables. The reason we went through the effort to compute the standard errors is because the CLT applies in linear models. If \(N\) is large enough, then the LSE will be normally distributed with mean \(\boldsymbol{\beta}\) and standard errors as described. For small samples, if the \(\varepsilon\) are normally distributed, then the \(\hat{\beta}-\beta\) follow a t-distribution. We do not derive this result here, but the results are extremely useful since it is how we construct p-values and confidence intervals in the context of linear models.
The standard approach to writing linear models either assume the values in \(\mathbf{X}\) are fixed or that we are conditioning on them. Thus \(\mathbf{X} \boldsymbol{\beta}\) has no variance as the \(\mathbf{X}\) is considered fixed. This is why we write \(\mbox{var}(Y_i) = \mbox{var}(\varepsilon_i)=\sigma^2\). This can cause confusion in practice because if you, for example, compute the following:
x = father.son$fheight
beta = c(34,0.5)
var(beta[1]+beta[2]*x)
## [1] 1.884
it is nowhere near 0. This is an example in which we have to be
careful in distinguishing code from math. The function var
is simply computing the variance of the list we feed it, while the
mathematical definition of variance is considering only quantities that
are random variables. In the R code above, x is not fixed
at all: we are letting it vary, but when we write \(\mbox{var}(Y_i) = \sigma^2\) we are
imposing, mathematically, x to be fixed. Similarly, if we
use R to compute the variance of \(Y\)
in our object dropping example, we obtain something very different than
\(\sigma^2=1\) (the known
variance):
n <- length(tt)
y <- h0 + v0*tt - 0.5*g*tt^2 + rnorm(n,sd=1)
var(y)
## [1] 326.7
Again, this is because we are not fixing tt.
To illustrate how we can adjust for batch effects using statistcal methods, we will create a data example in which the outcome of interest is confounded with batch but not completely. We will also select a outcome for which we have an expectation of what genes should be differentially expressed. Namely, we make sex the outcome of interest and expect genes on the Y chromosome to be differentially expressed. Note that we may also see genes from the X chromosome as differentially expressed as some escape X inactivation.
We start by finding the genes on the Y chromosome.
Now we select samples so that sex and month of hybridization are confounded.
To illustrate the confounding we will pick some genes to show in a heatmap plot. We pick all Y chromosome genes, some genes that we see correlate with batch, and then some randomly selected genes.
So what follows is like the analysis we would do in practice. We don’t know there is a batch and we are interested in finding genes that are different between males and females. We start by computing t-statistics and p-values comparing males and females. We use histograms to notice the problem introduced by the batch.
The batch effect adjustment methods are best described with the linear models so we start by writing down the linear more for this particular case:
Here we show how to implement Combat.
We have measurements for \(m\) genes and \(n\) samples in a matrix \(Y_{m\times n}\). Suppose we suspect that a batch effect is responsible for most the variability. We know that some samples fall in one batch and the rest in an other, but we don’t know which sample is in which batch. Can we discover the batch? If we assume that many genes will have a different average in batch compared to the other then we can quantify this problem as searching for the separation that makes many of these differences in average large. TO simplify and illustrate further assume \(n/2\) samples are in one batch and \(n/2\) in the other but we dont know whcih. Can we find the separation?
Assume the gene in row \(i\) is affected by batch. Then \[ (Y_{i1}, \dots, Y_{in}) (v_1,\dots,v_n) = \sum_{i=1}^n v_i Y_{in}' \] with each \(v_i\) either \(1/(n/2)\) or \(-1/(n/2)\) will give us the average difference between each batch for gene \(i\), call it \(m_i\). Because we think the batch effect many genes then we want to find the vector \(v=(v_1\dots,v_n)\) that maximizes the variace of \(m_1,\dots,m_n\).
There is actually a nice mathematical result that can help us find this vector. In fact, if we let \(v\) be any vector with standard deviation 1, then the \(v\) that maximizes the variance of \(Y_i v\) is called the first principal component directions or eigen vector. The vectors of “differences” \(Y_i v\), \(i=1,\dots,n\) is the first principal component and below we will refer to it as \(v_1\)
Now, suppose we think there is more unwanted variability affecting several genes. We can subtract the first principal component from \(Y_{m \times n}\), \(r_{m\times n}=Y_{m \times n} - Y_{m \times n} v_1 v_1'\) we can then find the vector \(v_2\) that results in the most variable vector \(r_{m\times n} v_2\). We continue this way until to obtain \(n\) eigen vectors \(V_{n\times n} = (v_1,\dots v_n)\).
The SVD is a very powerful mathematical result that gives us an algorithm to write a matrix in the following way:
$ Y_{m n} = U_{m times n} D_{n n} V’_{n n} $
With the columns of \(V\) the matrix with columns the eigen vectors defined above. The matrices \(U\) and \(V\) are orthogonal meaning that with \(U_i'U_i=1\) and \(U_i'U_i\)=0 where \(U_i\) and \(U_j\) are \(i\)th and \(j\)th columns of 1.
Notice this matrix: \[ Y_{m\times n} V = U_{m \times n} D_{n\times n} \] has the principal coponents as columns and that the standard deviation of the \(i\) principal component is \(D_{i,i}/n\): \[ (Y_{m\times n} V)'(Y_{m\times n} V) = D_{n\times n} U'_{m\times n} U_{m\times n} = D^2_{n\times n} \]
Let’s consider a simple example. Suppose we have the heights of identical twin pairs in an \(m\times 2\) matrix. We are asked to
library(MASS)
set.seed(1)
y=mvrnorm(1000,c(0,0),3^2*matrix(c(1,.9,.9,1),2,2))
mypar(1,1)
plot(y,xlab="Twin 1 (inches away from avg)",ylab="Twin 2 (inches away from avg)")
Transmitting the two heights seems inefficient given how correlated they. If we tranmist the pricipal components instead we save money. Let’s see how:
s=svd(y)
plot(s$u[,1]*s$d[1],s$u[,2]*s$d[2],ylim=range(s$u[,1]*s$d[1]),xlab="First PC",ylab="Second PC")
Jolliffe, Ian. Principal component analysis. John Wiley & Sons, Ltd, 2005.
Dunteman, George H. Principal components analysis. No. 69. Sage, 1989.
In the previous section, we motivated dimension reduction and showed a transformation that permitted us to approximate the distance between two dimensional points with just one dimension. The singular value decomposition (SVD) is a generalization of the algorithm we used in the motivational section. As in the example, the SVD provides a transformation of the original data. This transformation has some very useful properties.
The main result SVD provides is that we can write an \(m \times n\), matrix \(\mathbf{Y}\) as
\[\mathbf{U}^\top\mathbf{Y} = \mathbf{DV}^\top\]
With:
with \(p=\mbox{min}(m,n)\). \(\mathbf{U}^\top\) provides a rotation of our data \(\mathbf{Y}\) that turns out to be very useful because the variability (sum of squares to be precise) of the columns of \(\mathbf{U}^\top \mathbf{Y}=\mathbf{VD}\) are decreasing. Because \(\mathbf{U}\) is orthogonal, we can write the SVD like this:
\[\mathbf{Y} = \mathbf{UDV}^\top\]
In fact, this formula is much more commonly used. We can also write the transformation like this:
\[\mathbf{YV} = \mathbf{UD}\]
This transformation of \(Y\) also results in a matrix with column of decreasing sum of squares.
Applying the SVD to the motivating example we have:
library(rafalib)
library(MASS)
n <- 100
y <- t(mvrnorm(n,c(0,0), matrix(c(1,0.95,0.95,1),2,2)))
s <- svd(y)
We can immediately see that applying the SVD results in a transformation very similar to the one we used in the motivating example:
round(sqrt(2) * s$u , 3)
## [,1] [,2]
## [1,] -1.002 -0.998
## [2,] -0.998 1.002
The plot we showed after the rotation was showing what we call the principal components: the second plotted against the first. To obtain the principal components from the SVD, we simply need the columns of the rotation \(\mathbf{U}^\top\mathbf{Y}\) :
PC1 = s$d[1]*s$v[,1]
PC2 = s$d[2]*s$v[,2]
plot(PC1,PC2,xlim=c(-3,3),ylim=c(-3,3))
Second PC plotted against first PC for the twins height data.
It is not immediately obvious how incredibly useful the SVD can be, so let’s consider some examples. In this example, we will greatly reduce the dimension of \(V\) and still be able to reconstruct \(Y\).
Let’s compute the SVD on the gene expression table we have been working with. We will take a subset of 100 genes so that computations are faster.
The svd command returns the three matrices (only the
diagonal entries are returned for \(D\))
First note that we can in fact reconstruct y:
If we look at the sum of squares of \(\mathbf{UD}\), we see that the last few are quite close to 0 (perhaps we have some replicated columns).
This implies that the last columns of V have a very
small effect on the reconstruction of Y. To see this,
consider the extreme example in which the last entry of \(V\) is 0. In this case the last column of
\(V\) is not needed at all. Because of
the way the SVD is created, the columns of \(V\) have less and less influence on the
reconstruction of \(Y\). You commonly
see this described as “explaining less variance”. This implies that for
a large matrix, by the time you get to the last columns, it is possible
that there is not much left to “explain” As an example, we will look at
what happens if we remove the four last columns:
The largest residual is practically 0, meaning that we
Yhat is practically the same as Y, yet we need
4 fewer dimensions to transmit the information.
By looking at \(d\), we can see that, in this particular dataset, we can obtain a good approximation keeping only 94 columns. The following plots are useful for seeing how much of the variability is explained by each column:
We can also make a cumulative plot:
Although we start with 189 dimensions, we can approximate \(Y\) with just 95:
Therefore, by using only half as many dimensions, we retain most of the variability in our data:
We say that we explain 96% of the variability.
Note that we can compute this proportion from \(D\):
The entries of \(D\) therefore tell us how much each PC contributes in term of variability explained.
We will now demonstrate how to obtain a p-value in practice. We begin by loading experimental data and walking you through the steps used to form a t-statistic and compute a p-value. We can perform this task with just a few lines of code (go to the end of this section to see them). However, to understand the concepts, we will construct a t-statistic from “scratch”.
We start by reading in the data. A first important step is to identify which rows are associated with treatment and control, and to compute the difference in mean.
We are asked to report a p-value. What do we do? We learned that
diff, referred to as the observed effect size, is
a random variable. We also learned that under the null hypothesis, the
mean of the distribution of diff is 0. What about the
standard error? We also learned that the standard error of this random
variable is the population standard deviation divided by the square root
of the sample size:
\[ SE(\bar{X}) = \sigma / \sqrt{N}\]
We use the sample standard deviation as an estimate of the population
standard deviation. In R, we simply use the sd function and
the SE is:
This is the SE of the sample average, but we actually want the SE of
diff. We saw how statistical theory tells us that the
variance of the difference of two random variables is the sum of its
variances, so we compute the variance and take the square root:
Statistical theory tells us that if we divide a random variable by its SE, we get a new random variable with an SE of 1.
This ratio is what we call the t-statistic. It’s the ratio of two random variables and thus a random variable. Once we know the distribution of this random variable, we can then easily compute a p-value.
As explained in the previous section, the CLT tells us that for large
sample sizes, both sample averages mean(treatment) and
mean(control) are normal. Statistical theory tells us that
the difference of two normally distributed random variables is again
normal, so CLT tells us that tstat is approximately normal
with mean 0 (the null hypothesis) and SD 1 (we divided by its SE).
So now to calculate a p-value all we need to do is ask: how often
does a normally distributed random variable exceed diff? R
has a built-in function, pnorm, to answer this specific
question. pnorm(a) returns the probability that a random
variable following the standard normal distribution falls below
a. To obtain the probability that it is larger than
a, we simply use 1-pnorm(a). We want to know
the probability of seeing something as extreme as diff:
either smaller (more negative) than -abs(diff) or larger
than abs(diff). We call these two regions “tails” and
calculate their size:
In this case, the p-value is smaller than 0.05 and using the conventional cutoff of 0.05, we would call the difference statistically significant.
Now there is a problem. CLT works for large samples, but is 12 large enough? A rule of thumb for CLT is that 30 is a large enough sample size (but this is just a rule of thumb). The p-value we computed is only a valid approximation if the assumptions hold, which do not seem to be the case here. However, there is another option other than using CLT.
As described earlier, statistical theory offers another useful result. If the distribution of the population is normal, then we can work out the exact distribution of the t-statistic without the need for the CLT. This is a big “if” given that, with small samples, it is hard to check if the population is normal. But for something like weight, we suspect that the population distribution is likely well approximated by normal and that we can use this approximation. Furthermore, we can look at a qq-plot for the samples. This shows that the approximation is at least close:
library(rafalib)
mypar(1,2)
qqnorm(treatment)
qqline(treatment,col=2)
qqnorm(control)
qqline(control,col=2)
Quantile-quantile plots for sample against theoretical normal distribution.
If we use this approximation, then statistical theory tells us that
the distribution of the random variable tstat follows a
t-distribution. This is a much more complicated distribution than the
normal. The t-distribution has a location parameter like the normal and
another parameter called degrees of freedom. R has a nice
function that actually computes everything for us.
t.test(treatment, control)
##
## Welch Two Sample t-test
##
## data: treatment and control
## t = 0.46, df = 19, p-value = 0.6
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.796 4.384
## sample estimates:
## mean of x mean of y
## 24.80 24.01
To see just the p-value, we can use the $ extractor:
result <- t.test(treatment,control)
result$p.value
## [1] 0.6489
The p-value is slightly bigger now. This is to be expected because
our CLT approximation considered the denominator of tstat
practically fixed (with large samples it practically is), while the
t-distribution approximation takes into account that the denominator
(the standard error of the difference) is a random variable. The smaller
the sample size, the more the denominator varies.
It may be confusing that one approximation gave us one p-value and another gave us another, because we expect there to be just one answer. However, this is not uncommon in data analysis. We used different assumptions, different approximations, and therefore we obtained different results.
Later, in the power calculation section, we will describe type I and type II errors. As a preview, we will point out that the test based on the CLT approximation is more likely to incorrectly reject the null hypothesis (a false positive), while the t-distribution is more likely to incorrectly accept the null hypothesis (false negative).
Now that we have gone over the concepts, we can show the relatively simple code that one would use to actually compute a t-test:
The arguments to t.test can be of type
data.frame and thus we do not need to unlist them into numeric
objects.
All models are wrong, but some are useful. -George E. P. Box
When we see a p-value in the literature, it means a probability distribution of some sort was used to quantify the null hypothesis. Many times deciding which probability distribution to use is relatively straightforward. For example, in the tea tasting challenge we can use simple probability calculations to determine the null distribution. Most p-values in the scientific literature are based on sample averages or least squares estimates from a linear model and make use of the CLT to approximate the null distribution of their statistic as normal.
The CLT is backed by theoretical results that guarantee that the approximation is accurate. However, we cannot always use this approximation, such as when our sample size is too small. Previously, we described how the sample average can be approximated as t-distributed when the population data is approximately normal. However, there is no theoretical backing for this assumption. We are now modeling. In the case of height, we know from experience that this turns out to be a very good model.
But this does not imply that every dataset we collect will follow a normal distribution. Some examples are: coin tosses, the number of people who win the lottery, and US incomes. The normal distribution is not the only parametric distribution that is available for modeling. Here we provide a very brief introduction to some of the most widely used parametric distributions and some of their uses in the life sciences. We focus on the models and concepts needed to understand the techniques currently used to perform statistical inference on high-throughput data. To do this we also need to introduce the basics of Bayesian statistics. For more in-depth description of probability models and parametric distributions please consult a Statistics textbook such as this one.
The first distribution we will describe is the binomial distribution. It reports the probability of observing \(S=k\) successes in \(N\) trials as
\[ \mbox{Pr}(S=k) = {N \choose k}p^k (1-p)^{N-k} \]
with \(p\) the probability of success. The best known example is coin tosses with \(S\) the number of heads when tossing \(N\) coins. In this example \(p=0.5\).
Note that \(S/N\) is the average of independent random variables and thus the CLT tells us that \(S\) is approximately normal when \(N\) is large. This distribution has many applications in the life sciences. Recently, it has been used by the variant callers and genotypers applied to next generation sequencing. A special case of this distribution is approximated by the Poisson distribution which we describe next.
Since it is the sum of binary outcomes, the number of people that win the lottery follows a binomial distribution (we assume each person buys one ticket). The number of trials \(N\) is the number of people that buy tickets and is usually very large. However, the number of people that win the lottery oscillates between 0 and 3, which implies the normal approximation does not hold. So why does CLT not hold? One can explain this mathematically, but the intuition is that with the sum of successes so close to and also constrained to be larger than 0, it is impossible for the distribution to be normal. Here is a quick simulation:
p=10^-7 ##1 in 10,000,0000 chances of winning
N=5*10^6 ##5,000,000 tickets bought
winners=rbinom(1000,N,p) ##1000 is the number of different lotto draws
tab=table(winners)
plot(tab)
Number of people that win the lottery obtained from Monte Carlo simulation.
prop.table(tab)
## winners
## 0 1 2 3
## 0.601 0.314 0.074 0.011
For cases like this, where \(N\) is very large, but \(p\) is small enough to make \(N \times p\) (call it \(\lambda\)) a number between 0 and, for example, 10, then \(S\) can be shown to follow a Poisson distribution, which has a simple parametric form:
\[ \mbox{Pr}(S=k)=\frac{\lambda^k \exp{-\lambda}}{k!} \]
The Poisson distribution is commonly used in RNA-seq analyses. Because we are sampling thousands of molecules and most genes represent a very small proportion of the totality of molecules, the Poisson distribution seems appropriate.
So how does this help us? One way is that it provides insight about the statistical properties of summaries that are widely used in practice. For example, let’s say we only have one sample from each of a case and control RNA-seq experiment and we want to report the genes with the largest fold-changes. One insight that the Poisson model provides is that under the null hypothesis of no true differences, the statistical variability of this quantity depends on the total abundance of the gene. We can show this mathematically, but here is a quick simulation to demonstrate the point:
N=10000##number of genes
lambdas=2^seq(1,16,len=N) ##these are the true abundances of genes
y=rpois(N,lambdas)##note that the null hypothesis is true for all genes
x=rpois(N,lambdas)
ind=which(y>0 & x>0)##make sure no 0s due to ratio and log
library(rafalib)
splot(log2(lambdas),log2(y/x),subset=ind)
MA plot of simulated RNA-seq data. Replicated measurements follow a Poisson distribution.
For lower values of lambda there is much more
variability and, if we were to report anything with a fold change of 2
or more, the number of false positives would be quite high for low
abundance genes.
In this section we will use the data stored in this dataset:
library(parathyroidSE) ##available from Bioconductor
## Warning: package 'MatrixGenerics' was built under R version 4.3.1
data(parathyroidGenesSE)
se <- parathyroidGenesSE
The data is contained in a SummarizedExperiment object,
which we do not describe here. The important thing to know is that it
includes a matrix of data, where each row is a genomic feature and each
column is a sample. We can extract this data using the
assay function. For this dataset, the value of a single
cell in the data matrix is the count of reads which align to a given
gene for a given sample. Thus, a similar plot to the one we simulated
above with technical replicates reveals that the behavior predicted by
the model is present in experimental data:
x <- assay(se)[,23]
y <- assay(se)[,24]
ind=which(y>0 & x>0)##make sure no 0s due to ratio and log
splot((log2(x)+log2(y))/2,log(x/y),subset=ind)
MA plot of replicated RNA-seq data.
If we compute the standard deviations across four individuals, it is quite a bit higher than what is predicted by a Poisson model. Assuming most genes are differentially expressed across individuals, then, if the Poisson model is appropriate, there should be a linear relationship in this plot:
library(rafalib)
library(matrixStats)
vars=rowVars(assay(se)[,c(2,8,16,21)]) ##we know these are four
means=rowMeans(assay(se)[,c(2,8,16,21)]) ##different individuals
splot(means,vars,log="xy",subset=which(means>0&vars>0)) ##plot a subset of data
abline(0,1,col=2,lwd=2)
Variance versus mean plot. Summaries were obtained from the RNA-seq data.
The reason for this is that the variability plotted here includes biological variability, which the motivation for the Poisson does not include. The negative binomial distribution, which combines the sampling variability of a Poisson and biological variability, is a more appropriate distribution to model this type of experiment. The negative binomial has two parameters and permits more flexibility for count data. For more on the use of the negative binomial to model RNA-seq data you can read this paper. The Poisson is a special case of the negative binomial distribution.
To illustrate the concept of maximum likelihood estimates (MLE), we use a relatively simple dataset containing palindrome locations in the HMCV genome. We read in the locations of the palindrome and then count the number of palindromes in each 4,000 basepair segments.
The counts do appear to follow a Poisson distribution. But what is the rate \(\lambda\) ? The most common approach to estimating this rate is maximum likelihood estimation. To find the maximum likelihood estimate (MLE), we note that these data are independent and the probability of observing the values we observed is:
\[ \Pr(X_1=k_1,\dots,X_n=k_n;\lambda) = \prod_{i=1}^n \lambda^{k_i} / k_i! \exp ( -\lambda) \]
The MLE is the value of \(\lambda\) that maximizes the likelihood:.
\[ \mbox{L}(\lambda; X_1=k_1,\dots,X_n=k_1)=\exp\left\{\sum_{i=1}^n \log \Pr(X_i=k_i;\lambda)\right\} \]
In practice, it is more convenient to maximize the log-likelihood
which is the summation that is exponentiated in the expression above.
Below we write code that computes the log-likelihood for any \(\lambda\) and use the function
optimize to find the value that maximizes this function
(the MLE). We show a plot of the log-likelihood along with vertical line
showing the MLE.
`
If you work out the math and do a bit of calculus, you realize that this is a particularly simple example for which the MLE is the average.
Note that a plot of observed counts versus counts predicted by the Poisson shows that the fit is quite good in this case:
We therefore can model the palindrome count data with a Poisson with \(\lambda=5.16\).
Different genes vary differently across biological replicates. Later, in the hierarchical models chapter, we will describe one of the most influential statistical methods in the analysis of genomics data. This method provides great improvements over naive approaches to detecting differentially expressed genes. This is achieved by modeling the distribution of the gene variances. Here we describe the parametric model used in this method.
We want to model the distribution of the gene-specific standard errors. Are they normal? Keep in mind that we are modeling the population standard errors so CLT does not apply, even though we have thousands of genes.
As an example, we use an experimental data that included both technical and biological replicates for gene expression measurements on mice. We can load the data and compute the gene specific sample standard error for both the technical replicates and the biological replicates
An important observation here is that the biological variability is substantially higher than the technical variability. This provides strong evidence that genes do in fact have gene-specific biological variability.
If we want to model this variability, we first notice that the normal distribution is not appropriate here since the right tail is rather large. Also, because SDs are strictly positive, there is a limitation to how symmetric this distribution can be. A qqplot shows this very clearly:
There are parametric distributions that possess these properties (strictly positive and heavy right tails). Two examples are the gamma and F distributions. The density of the gamma distribution is defined by:
\[ f(x;\alpha,\beta)=\frac{\beta^\alpha x^{\alpha-1}\exp{-\beta x}}{\Gamma(\alpha)} \]
It is defined by two parameters \(\alpha\) and \(\beta\) that can, indirectly, control location and scale. They also control the shape of the distribution. For more on this distribution please refer to this book.
Two special cases of the gamma distribution are the chi-squared and exponential distribution. We used the chi-squared earlier to analyze a 2x2 table data. For chi-square, we have \(\alpha=\nu/2\) and \(\beta=2\) with \(\nu\) the degrees of freedom. For exponential, we have \(\alpha=1\) and \(\beta=\lambda\) the rate.
The F-distribution comes up in analysis of variance (ANOVA). It is also always positive and has large right tails. Two parameters control its shape:
\[ f(x,d_1,d_2)=\frac{1}{B\left( \frac{d_1}{2},\frac{d_2}{2}\right)} \left(\frac{d_1}{d_2}\right)^{\frac{d_1}{2}} x^{\frac{d_1}{2}-1}\left(1+\frac{d1}{d2}x\right)^{-\frac{d_1+d_2}{2}} \]
with \(B\) the beta function and \(d_1\) and \(d_2\) are called the degrees of freedom for reasons having to do with how it arises in ANOVA. A third parameter is sometimes used with the F-distribution, which is a scale parameter.
In a later section we will learn about a hierarchical model approach to improve estimates of variance. In these cases it is mathematically convenient to model the distribution of the variance \(\sigma^2\). The hierarchical model used here implies that the sample standard deviation of genes follows scaled F-statistics:
\[ s^2 \sim s_0^2 F_{d,d_0} \]
with \(d\) the degrees of freedom involved in computing \(s^2\). For example, in a case comparing 3 versus 3, the degrees of freedom would be 4. This leaves two free parameters to adjust to the data. Here \(d\) will control the location and \(s_0\) will control the scale. Below are some examples of \(F\) distribution plotted on top of the histogram from the sample variances:
Now which \(s_0\) and \(d\) fit our data best? This is a rather
advanced topic as the MLE does not perform well for this particular
distribution (we refer to Smyth (2004)). The Bioconductor
limma package provides a function to estimate these
parameters:
The fitted models do appear to provide a reasonable approximation, as demonstrated by the qq-plot and histogram:
RNA-seq is a valuable experiment for quantifying both the types and the amount of RNA molecules in a sample. We’ve covered the basic idea of the protocol in lectures, but some early references for RNA-seq include Mortazavi (2008) and Marioni (2008).
In this lab, we will focus on comparing the expression levels of genes across different samples, by counting the number of reads which overlap the exons of genes defined by a known annotation. As described in the lecture, this analysis sets aside the task of estimating the different kinds of RNA molecules, and the different isoforms for genes with multiple isoforms. One advantage of looking at these matrices of counts is that we can use statistical distributions to model how the variance of counts will change when the counts are low vs high. We will explore the relationship of the variance of counts to the mean later in this lab.
In this lab we will examine 8 samples from the airway package, which are from the paper by Himes et al: “RNA-seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells”.
This lab will focus on a summarized version of an RNA-seq experiment: a count matrix, which has genes along the rows and samples along the columns. The values in the matrix are the number of reads which could be uniquely aligned to the exons of a given gene for a given sample. We will demonstrate how to build a count matrix for a subset of reads from an experiment, and then use a pre-made count matrix, to avoid having students download the multi-gigabyte BAM files containing the aligned reads. A new pipeline for building count matrices, which skips the alignment step, is to use fast pseudoaligners such as Sailfish, Salmon and kallisto, followed by the tximport package. See the package vignette for more details. Here, we will continue with the read counting pipeline.
First, make variables for the different BAM files and GTF file. Use
the sample.table to contruct the BAM file vector, so that
the count matrix will be in the same order as the
sample.table.
library(airway)
dir <- system.file("extdata", package="airway", mustWork=TRUE)
csv.file <- file.path(dir, "sample_table.csv")
sample.table <- read.csv(csv.file, row.names=1)
bam.files <- file.path(dir, paste0(sample.table$Run, "_subset.bam"))
gtf.file <- file.path(dir, "Homo_sapiens.GRCh37.75_subset.gtf")
Next we create an Rsamtools variable which wraps our BAM
files, and create a transcript database from the GTF file. We can ignore
the warning about matchCircularity. Finally, we make a
GRangesList which contains the exons for each gene.
library(Rsamtools)
bam.list <- BamFileList(bam.files)
library(GenomicFeatures)
## Warning: package 'GenomicFeatures' was built under R version 4.3.1
## Loading required package: AnnotationDbi
## Warning: package 'AnnotationDbi' was built under R version 4.3.1
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'GenomicFeatures'
## The following objects are masked from 'package:cummeRbund':
##
## features, genes
# for Bioc 3.0 use the commented out line
# txdb <- makeTranscriptDbFromGFF(gtf.file, format="gtf")
txdb <- makeTxDbFromGFF(gtf.file, format="gtf")
## Import genomic features from the file as a GRanges object ...
## OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ...
## Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of type
## stop_codon. This information was ignored.
## OK
exons.by.gene <- exonsBy(txdb, by="gene")
The following code chunk creates a SummarizedExperiment
containing the counts for the reads in each BAM file (columns) for each
gene in exons.by.gene (the rows). We add the
sample.table as column data. Remember, we know the order is
correct, because the bam.list was constructed from a column
of sample.table.
library(GenomicAlignments)
##
## Attaching package: 'GenomicAlignments'
## The following object is masked from 'package:dplyr':
##
## last
se <- summarizeOverlaps(exons.by.gene, bam.list,
mode="Union",
singleEnd=FALSE,
ignore.strand=TRUE,
fragments=TRUE)
colData(se) <- DataFrame(sample.table)
A similar function in the Rsubread library can be used to construct a count matrix:
# not available for Windows - use above method instead
library(Rsubread)
fc <- featureCounts(bam.files, annot.ext=gtf.file,
isGTFAnnotationFile=TRUE,
isPaired=TRUE)
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.14.2
##
## //========================== featureCounts setting ===========================\\
## || ||
## || Input files : 8 BAM files ||
## || ||
## || SRR1039508_subset.bam ||
## || SRR1039509_subset.bam ||
## || SRR1039512_subset.bam ||
## || SRR1039513_subset.bam ||
## || SRR1039516_subset.bam ||
## || SRR1039517_subset.bam ||
## || SRR1039520_subset.bam ||
## || SRR1039521_subset.bam ||
## || ||
## || Paired-end : yes ||
## || Count read pairs : yes ||
## || Annotation : Homo_sapiens.GRCh37.75_subset.gtf (GTF) ||
## || Dir for temp files : . ||
## || Threads : 1 ||
## || Level : meta-feature level ||
## || Multimapping reads : counted ||
## || Multi-overlapping reads : not counted ||
## || Min overlapping bases : 1 ||
## || ||
## \\============================================================================//
##
## //================================= Running ==================================\\
## || ||
## || Load annotation file Homo_sapiens.GRCh37.75_subset.gtf ... ||
## || Features : 406 ||
## || Meta-features : 20 ||
## || Chromosomes/contigs : 1 ||
## || ||
## || Process BAM file SRR1039508_subset.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 7142 ||
## || Successfully assigned alignments : 6734 (94.3%) ||
## || Running time : 0.00 minutes ||
## || ||
## || Process BAM file SRR1039509_subset.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 7200 ||
## || Successfully assigned alignments : 6773 (94.1%) ||
## || Running time : 0.00 minutes ||
## || ||
## || Process BAM file SRR1039512_subset.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 8536 ||
## || Successfully assigned alignments : 8008 (93.8%) ||
## || Running time : 0.00 minutes ||
## || ||
## || Process BAM file SRR1039513_subset.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 7544 ||
## || Successfully assigned alignments : 7101 (94.1%) ||
## || Running time : 0.00 minutes ||
## || ||
## || Process BAM file SRR1039516_subset.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 8818 ||
## || Successfully assigned alignments : 8342 (94.6%) ||
## || Running time : 0.00 minutes ||
## || ||
## || Process BAM file SRR1039517_subset.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 11850 ||
## || Successfully assigned alignments : 11259 (95.0%) ||
## || Running time : 0.00 minutes ||
## || ||
## || Process BAM file SRR1039520_subset.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 5877 ||
## || Successfully assigned alignments : 5469 (93.1%) ||
## || Running time : 0.00 minutes ||
## || ||
## || Process BAM file SRR1039521_subset.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 10208 ||
## || Successfully assigned alignments : 9594 (94.0%) ||
## || Running time : 0.00 minutes ||
## || ||
## || Write the final count table. ||
## || Write the read assignment summary. ||
## || ||
## \\============================================================================//
names(fc)
## [1] "counts" "annotation" "targets" "stat"
unname(fc$counts) # hide the colnames
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0 0 4 1 0 0 1 0
## [2,] 0 0 1 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0
## [5,] 2752 2080 3353 1617 3521 3716 2220 1991
## [6,] 38 28 66 24 42 41 47 36
## [7,] 58 55 55 42 53 54 50 43
## [8,] 1004 1255 1122 1313 1100 1879 745 1536
## [9,] 1062 1293 1360 1286 1201 1725 919 1940
## [10,] 2 1 2 1 2 7 1 4
## [11,] 2 0 6 7 1 7 1 6
## [12,] 1592 1753 1786 2013 2150 3355 1264 2571
## [13,] 1 0 0 1 0 0 0 0
## [14,] 1 0 0 0 0 0 0 0
## [15,] 0 0 0 0 0 0 0 0
## [16,] 4 50 19 543 1 10 14 1067
## [17,] 0 0 0 1 0 0 0 0
## [18,] 0 0 0 0 0 0 0 0
## [19,] 218 257 234 252 269 465 207 400
## [20,] 0 1 0 0 2 0 0 0
Plot the first column from each function against each other (after matching the rows of the featureCounts matrix to the one returned by summarizeOverlaps.
plot(assay(se)[,1],
fc$counts[match(rownames(se),rownames(fc$counts)),1])
abline(0,1)
We now load the full SummarizedExperiment object, counting reads over all the genes.
library(airway)
data(airway)
airway
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
colData(airway)
## DataFrame with 8 rows and 9 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample
## <factor> <factor> <factor>
## SRR1039508 SRX384345 SRS508568 SAMN02422669
## SRR1039509 SRX384346 SRS508567 SAMN02422675
## SRR1039512 SRX384349 SRS508571 SAMN02422678
## SRR1039513 SRX384350 SRS508572 SAMN02422670
## SRR1039516 SRX384353 SRS508575 SAMN02422682
## SRR1039517 SRX384354 SRS508576 SAMN02422673
## SRR1039520 SRX384357 SRS508579 SAMN02422683
## SRR1039521 SRX384358 SRS508580 SAMN02422677
rowRanges(airway)
## GRangesList object of length 63677:
## $ENSG00000000003
## GRanges object with 17 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] X 99883667-99884983 - | 667145 ENSE00001459322
## [2] X 99885756-99885863 - | 667146 ENSE00000868868
## [3] X 99887482-99887565 - | 667147 ENSE00000401072
## [4] X 99887538-99887565 - | 667148 ENSE00001849132
## [5] X 99888402-99888536 - | 667149 ENSE00003554016
## ... ... ... ... . ... ...
## [13] X 99890555-99890743 - | 667156 ENSE00003512331
## [14] X 99891188-99891686 - | 667158 ENSE00001886883
## [15] X 99891605-99891803 - | 667159 ENSE00001855382
## [16] X 99891790-99892101 - | 667160 ENSE00001863395
## [17] X 99894942-99894988 - | 667161 ENSE00001828996
## -------
## seqinfo: 722 sequences (1 circular) from an unspecified genome
##
## ...
## <63676 more elements>
The counts matrix is stored in assay of a
SummarizedExperiment.
head(assay(airway))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 679 448 873 408 1138
## ENSG00000000005 0 0 0 0 0
## ENSG00000000419 467 515 621 365 587
## ENSG00000000457 260 211 263 164 245
## ENSG00000000460 60 55 40 35 78
## ENSG00000000938 0 0 2 0 1
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 1047 770 572
## ENSG00000000005 0 0 0
## ENSG00000000419 799 417 508
## ENSG00000000457 331 233 229
## ENSG00000000460 63 76 60
## ENSG00000000938 0 0 0
This code chunk is not necessary, but helps to make nicer plots below
with large axis labels (mypar(1,2) can be substituted with
par(mfrow=c(1,2)) below).
# library(devtools)
# install_github("ririzarr/rafalib")
library(rafalib)
mypar()
Note that, on the un-transformed scale, the high count genes have high variance. That is, in the following scatter plot, the points start out in a tight cone and then fan out toward the top right. This is a general property of counts generated from sampling processes, that the variance typically increases with the expected value. We will explore different scaling and transformations options below.
plot(assay(airway)[,1:2], cex=.1)
We will use the DESeq2 package to normalize the sample
for sequencing depth. The DESeqDataSet object is just an
extension of the SummarizedExperiment object, with a few
changes. The matrix in assay is now accessed with
counts and the elements of this matrix are required to be
non-negative integers (0,1,2,…).
We need to specify an experimental design here, for later
use in differential analysis. The design starts with the tilde symbol
~, which means, model the counts (log2 scale) using the
following formula. Following the tilde, the variables are columns of the
colData, and the + indicates that for
differential expression analysis we want to compare levels of
dex while controlling for the cell
differences.
library(DESeq2)
## Warning: package 'DESeq2' was built under R version 4.3.1
##
## Attaching package: 'DESeq2'
## The following object is masked from 'package:cummeRbund':
##
## fpkm
dds <- DESeqDataSet(airway, design= ~ cell + dex)
We can also make a DESeqDataSet from a count matrix and column data.
dds.fc <- DESeqDataSetFromMatrix(assay(se), # or fc$counts from Rsubread
colData=sample.table,
design=~ cell + dex)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
The following estimates size factors to account for differences in
sequencing depth, and is only necessary to make the
log.norm.counts object below.
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
## 1.0236 0.8962 1.1795 0.6701 1.1777 1.3990 0.9208
## SRR1039521
## 0.9445
colSums(counts(dds))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
## 20637971 18809481 25348649 15163415 24448408 30818215 19126151
## SRR1039521
## 21164133
plot(sizeFactors(dds), colSums(counts(dds)))
abline(lm(colSums(counts(dds)) ~ sizeFactors(dds) + 0))
Size factors are calculated by the median ratio of samples to a pseudo-sample (the geometric mean of all samples). In other words, for each sample, we take the exponent of the median of the log ratios in this histogram.
loggeomeans <- rowMeans(log(counts(dds)))
hist(log(counts(dds)[,1]) - loggeomeans,
col="grey", main="", xlab="", breaks=40)
The size factor for the first sample:
exp(median((log(counts(dds)[,1]) - loggeomeans)[is.finite(loggeomeans)]))
## [1] 1.024
sizeFactors(dds)[1]
## SRR1039508
## 1.024
Make a matrix of log normalized counts (plus a pseudocount):
log.norm.counts <- log2(counts(dds, normalized=TRUE) + 1)
Another way to make this matrix, and keep the sample and gene
information is to use the function normTransform. The same
matrix as above is stored in assay(log.norm).
log.norm <- normTransform(dds)
Examine the log counts and the log normalized counts (plus a pseudocount).
rs <- rowSums(counts(dds))
mypar(1,2)
boxplot(log2(counts(dds)[rs > 0,] + 1)) # not normalized
boxplot(log.norm.counts[rs > 0,]) # normalized
Make a scatterplot of log normalized counts against each other. Note the fanning out of the points in the lower left corner, for points less than \(2^5 = 32\).
plot(log.norm.counts[,1:2], cex=.1)
Now we will use a more sophisticated transformation, which is similar
to the variance stablizing normalization method taught in Week 3 of
Course 4: Introduction to Bioconductor. It uses the variance model for
count data to shrink together the log-transformed counts for genes with
very low counts. For genes with medium and high counts, the
rlog is very close to log2. For further
details, see the section in the DESeq2 paper.
rld <- rlog(dds)
plot(assay(rld)[,1:2], cex=.1)
Another transformation for stabilizing variance in the
DESeq2 package is
varianceStabilizingTransformation. These two tranformations
are similar, the rlog might perform a bit better when the size
factors vary widely, and the varianceStabilizingTransformation
is much faster when there are many samples.
vsd <- varianceStabilizingTransformation(dds)
plot(assay(vsd)[,1:2], cex=.1)
We can examine the standard deviation of rows over the mean for the log plus pseudocount and the rlog. Note that the genes with high variance for the log come from the genes with lowest mean. If these genes were included in a distance calculation, the high variance at the low count range might overwhelm the signal at the higher count range.
library(vsn)
meanSdPlot(log.norm.counts, ranks=FALSE)
For the rlog:
meanSdPlot(assay(rld), ranks=FALSE)
For the VST:
meanSdPlot(assay(vsd), ranks=FALSE)
The principal components (PCA) plot is a useful diagnostic for examining relationships between samples:
plotPCA(log.norm, intgroup="dex")
Using the rlog:
plotPCA(rld, intgroup="dex")
Using the VST:
plotPCA(vsd, intgroup="dex")
We can make this plot even nicer using custom code from the ggplot2 library:
library(ggplot2)
(data <- plotPCA(rld, intgroup=c("dex","cell"), returnData=TRUE))
## PC1 PC2 group dex cell name
## SRR1039508 -14.327 -4.205 untrt:N61311 untrt N61311 SRR1039508
## SRR1039509 6.752 -2.242 trt:N61311 trt N61311 SRR1039509
## SRR1039512 -8.129 -3.950 untrt:N052611 untrt N052611 SRR1039512
## SRR1039513 14.501 -2.940 trt:N052611 trt N052611 SRR1039513
## SRR1039516 -11.887 13.728 untrt:N080611 untrt N080611 SRR1039516
## SRR1039517 8.372 17.816 trt:N080611 trt N080611 SRR1039517
## SRR1039520 -9.963 -10.012 untrt:N061011 untrt N061011 SRR1039520
## SRR1039521 14.681 -8.194 trt:N061011 trt N061011 SRR1039521
(percentVar <- 100*round(attr(data, "percentVar"),2))
## [1] 39 27
makeLab <- function(x,pc) paste0("PC",pc,": ",x,"% variance")
ggplot(data, aes(PC1,PC2,col=dex,shape=cell)) + geom_point() +
xlab(makeLab(percentVar[1],1)) + ylab(makeLab(percentVar[2],2))
In addition, we can plot a hierarchical clustering based on Euclidean distance matrix:
mypar(1,2)
plot(hclust(dist(t(log.norm.counts))), labels=colData(dds)$dex)
plot(hclust(dist(t(assay(rld)))), labels=colData(rld)$dex)
We will now perform differential gene expression on the counts, to try to find genes in which the differences in expected counts across samples due to the condition of interest rises above the biological and technical variance we observe.
We will use an overdispersed Poisson distribution – called the negative binomial – to model the raw counts in the count matrix. The model will include the size factors into account to adjust for sequencing depth. The formula will look like:
\[ K_{ij} \sim \text{NB}(s_{ij} q_{ij}, \alpha_i ) \]
where \(K_{ij}\) is a single raw count in our count table, \(s_{ij}\) is a size factor or more generally a normalization factor, \(q_{ij}\) is proportional to gene expression (what we want to model with our design variables), and \(\alpha_i\) is a dispersion parameter.
Why bother modeling raw counts, rather than dividing out the sequencing depth and working with the normalized counts? In other words, why put the \(s_{ij}\) on the right side of the equation above, rather than dividing out on the left side and modeling \(K_{ij} / s_{ij}\). The reason is that, with the raw count, we have knowledge about the link between the expected value and its variance. So we prefer the first equation below to the second equation, because with the first equation, we have some additional information about the variance of the quantity on the left hand side.
\[ K_{ij} \sim \text{NB}(\mu_{ij} = s_{ij} q_{ij} ) \]
\[ \frac{K_{ij}}{s_{ij}} \sim \mathcal{L}(\mu_{ij} = q_{ij}) \]
When we sample cDNA fragments from a pool in a sequencing library, we can model the count of cDNA fragments which originated from a given gene with a binomial distribution, with a certain probability of picking a fragment for that gene which relates to factors such as the expression of that gene (the abundance of mRNA in the original population of cells), its length and technical factors in the production of the library. When we have many genes, and the rate for each gene is low, while the total number of fragments is high, we know that the Poisson is a good model for the binomial. And for the binomial and the Poisson, there is an explicit link between on observed count and its expected variance.
Below is an example of what happens when we divide or multiply a raw count. Here we show three distributions which all have the expected value of 100, although they have different variances. The first is a raw count with mean 100, the second and third are raw counts with mean 1000 and 10, which were then scaled by 1/10 and 10, respectively.
mypar(3,1)
n <- 10000
brks <- 0:300
hist(rpois(n,100),main="",xlab="",breaks=brks,col="black")
hist(rpois(n,1000)/10,main="",xlab="",breaks=brks,col="black")
hist(rpois(n,10)*10,main="",xlab="",breaks=brks,col="black")
So, when we scale a raw count, we break the implicit link between the mean and the variance. This is not necessarily a problem, if we have 100s of samples over which to observe within-group variance, however RNA-seq samples can often have only 3 samples per group, in which case, we can get a benefit of information from using raw counts, and incorporating normalization factors on the right side of the equation above.
For the negative binomial, the variance parameter is called disperison, and it links the mean value with the expected variance. The reason we see more dispersion than in a Poisson is mostly due to changes in the proportions of genes across biological replicates – which we would expect due to natural differences in gene expression.
mypar(3,1)
n <- 10000
brks <- 0:400
hist(rpois(n,lambda=100),
main="Poisson / NB, disp=0",xlab="",breaks=brks,col="black")
hist(rnbinom(n,mu=100,size=1/.01),
main="NB, disp = 0.01",xlab="",breaks=brks,col="black")
hist(rnbinom(n,mu=100,size=1/.1),
main="NB, disp = 0.1",xlab="",breaks=brks,col="black")
The square root of the dispersion is the coefficient of variation – SD/mean – after subtracting the variance we expect due to Poisson sampling.
disp <- 0.5
mu <- 100
v <- mu + disp * mu^2
sqrt(v)/mu
## [1] 0.7141
sqrt(v - mu)/mu
## [1] 0.7071
sqrt(disp)
## [1] 0.7071
A number of methods for assessing differential gene expression from RNA-seq counts use the negative binomial distribution to make probabilistic statements about the differences seen in an experiment. A few such methods are edgeR, DESeq2, and DSS. Other methods, such as limma+voom find other ways to explicitly model the mean of log counts and the observed variance of log counts. A very incomplete list of statistical methods for RNA-seq differential expression is provided in the footnotes.
DESeq2 performs a similar step to limma as discussed in PH525x Course 3, in using the variance of all the genes to improve the variance estimate for each individual gene. In addition, DESeq2 shrinks the unreliable fold changes from genes with low counts, which will be seen in the resulting MA-plot.
Remember, we had created the DESeqDataSet object earlier using the following line of code (or alternatively using DESeqDataSetFromMatrix)
dds <- DESeqDataSet(airway, design= ~ cell + dex)
First, we setup the design of the experiment, so that
differences will be considered across time and protocol variables. We
can read and if necessary reset the design using the following code.
design(dds)
## ~cell + dex
design(dds) <- ~ cell + dex
The last variable in the design is used by default for building
results tables (although arguments to results can be used
to customize the results table), and we make sure the “control” or
“untreated” level is the first level, such that log fold changes will be
treated over control, and not control over treated.
levels(dds$dex)
## [1] "trt" "untrt"
dds$dex <- relevel(dds$dex, "untrt")
levels(dds$dex)
## [1] "untrt" "trt"
The following line runs the DESeq2 model. After this step, we can build a results table, which by default will compare the levels in the last variable in the design, so the dex treatment in our case:
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)
head(res)
## log2 fold change (MLE): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 708.602170 -0.3812539 0.100654 -3.787751 0.000152017
## ENSG00000000005 0.000000 NA NA NA NA
## ENSG00000000419 520.297901 0.2068127 0.112219 1.842944 0.065337210
## ENSG00000000457 237.163037 0.0379206 0.143445 0.264357 0.791504963
## ENSG00000000460 57.932633 -0.0881677 0.287142 -0.307053 0.758803336
## ENSG00000000938 0.318098 -1.3782340 3.499875 -0.393795 0.693732273
## padj
## <numeric>
## ENSG00000000003 0.00128293
## ENSG00000000005 NA
## ENSG00000000419 0.19646960
## ENSG00000000457 0.91141814
## ENSG00000000460 0.89500645
## ENSG00000000938 NA
table(res$padj < 0.1)
##
## FALSE TRUE
## 13198 4820
A summary of the results can be generated:
summary(res)
##
## out of 33469 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 2604, 7.8%
## LFC < 0 (down) : 2216, 6.6%
## outliers [1] : 0, 0%
## low counts [2] : 15451, 46%
## (mean count < 5)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
For testing at a different threshold, we provide the
alpha to results, so that the mean filtering is
optimal for our new FDR threshold.
res2 <- results(dds, alpha=0.05)
table(res2$padj < 0.05)
##
## FALSE TRUE
## 12754 4028
The MA-plot provides a global view of the differential genes, with the log2 fold change on the y-axis over the mean of normalized counts:
plotMA(res, ylim=c(-4,4))
We can also test against a different null hypothesis. For example, to test for genes which have fold change more than doubling or less than halving:
res.thr <- results(dds, lfcThreshold=1)
plotMA(res.thr, ylim=c(-4,4))
A p-value histogram:
hist(res$pvalue[res$baseMean > 1],
col="grey", border="white", xlab="", ylab="", main="")
A sorted results table:
resSort <- res[order(res$padj),]
head(resSort)
## log2 fold change (MLE): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000152583 997.440 4.57492 0.184056 24.8561 2.22232e-136
## ENSG00000165995 495.093 3.29106 0.133174 24.7125 7.83976e-135
## ENSG00000120129 3409.029 2.94781 0.121438 24.2743 3.66732e-130
## ENSG00000101347 12703.387 3.76700 0.155438 24.2347 9.58234e-130
## ENSG00000189221 2341.767 3.35358 0.141782 23.6530 1.09937e-123
## ENSG00000211445 12285.615 3.73040 0.165831 22.4953 4.61814e-112
## padj
## <numeric>
## ENSG00000152583 4.00418e-132
## ENSG00000165995 7.06284e-131
## ENSG00000120129 2.20259e-126
## ENSG00000101347 4.31636e-126
## ENSG00000189221 3.96169e-120
## ENSG00000211445 1.38683e-108
Examine the counts for the top gene, sorting by p-value:
plotCounts(dds, gene=which.min(res$padj), intgroup="dex")
A more sophisticated plot of counts:
library(ggplot2)
data <- plotCounts(dds, gene=which.min(res$padj), intgroup=c("dex","cell"), returnData=TRUE)
ggplot(data, aes(x=dex, y=count, col=cell)) +
geom_point(position=position_jitter(width=.1,height=0)) +
scale_y_log10()
Connecting by lines shows the differences which are actually being
tested by results given that our design includes
cell + dex
ggplot(data, aes(x=dex, y=count, col=cell, group=cell)) +
geom_point() + geom_line() + scale_y_log10()
A heatmap of the top genes:
library(pheatmap)
topgenes <- head(rownames(resSort),20)
mat <- assay(rld)[topgenes,]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(dds)[,c("dex","cell")])
pheatmap(mat, annotation_col=df)
We can then check the annotation of these highly significant genes:
The contrast argument allows users to specify what
results table should be built. See the help and examples in
?results for more details:
If we suppose that we didn’t know about the different cell-lines in the experiment, but noticed some structure in the counts, we could use surrograte variable analysis (SVA) to detect this hidden structure (see PH525x Course 3 for details on the algorithm).
Do the surrogate variables capture the cell difference?
Using the surrogate variables in a DESeq2 analysis:
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Big Sur 11.7.8
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Mexico_City
## tzcode source: internal
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] pheatmap_1.0.12 vsn_3.68.0
## [3] DESeq2_1.40.2 Rsubread_2.14.2
## [5] GenomicAlignments_1.36.0 GenomicFeatures_1.52.1
## [7] AnnotationDbi_1.62.2 Rsamtools_2.16.0
## [9] airway_1.20.0 parathyroidSE_1.38.0
## [11] SummarizedExperiment_1.30.2 MatrixGenerics_1.12.3
## [13] matrixStats_1.0.0 class_7.3-22
## [15] RColorBrewer_1.1-3 cummeRbund_2.42.0
## [17] Gviz_1.44.1 rtracklayer_1.60.1
## [19] fastcluster_1.2.3 reshape2_1.4.4
## [21] ggplot2_3.4.3 RSQLite_2.3.1
## [23] knitr_1.43 MASS_7.3-60
## [25] bumphunter_1.42.0 locfit_1.5-9.8
## [27] iterators_1.0.14 foreach_1.5.2
## [29] GenomicRanges_1.52.0 coloncancermeth_1.0
## [31] limma_3.56.2 genefilter_1.82.1
## [33] SpikeInSubset_1.40.1 affy_1.78.2
## [35] Biobase_2.60.0 rafalib_1.0.2
## [37] dplyr_1.1.2 downloader_0.4
## [39] Biostrings_2.68.1 GenomeInfoDb_1.36.1
## [41] XVector_0.40.0 IRanges_2.34.1
## [43] S4Vectors_0.38.1 BiocGenerics_0.46.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.3.0 BiocIO_1.10.0 bitops_1.0-7
## [4] filelock_1.0.2 tibble_3.2.1 preprocessCore_1.62.1
## [7] XML_3.99-0.14 rpart_4.1.19 lifecycle_1.0.3
## [10] lattice_0.21-8 ensembldb_2.24.0 backports_1.4.1
## [13] magrittr_2.0.3 Hmisc_5.1-0 sass_0.4.7
## [16] rmarkdown_2.24 jquerylib_0.1.4 yaml_2.3.7
## [19] doRNG_1.8.6 DBI_1.1.3 abind_1.4-5
## [22] zlibbioc_1.46.0 AnnotationFilter_1.24.0 biovizBase_1.48.0
## [25] RCurl_1.98-1.12 nnet_7.3-19 VariantAnnotation_1.46.0
## [28] rappdirs_0.3.3 GenomeInfoDbData_1.2.10 annotate_1.78.0
## [31] codetools_0.2-19 DelayedArray_0.26.7 xml2_1.3.5
## [34] tidyselect_1.2.0 farver_2.1.1 BiocFileCache_2.8.0
## [37] base64enc_0.1-3 jsonlite_1.8.7 Formula_1.2-5
## [40] survival_3.5-7 tools_4.3.0 progress_1.2.2
## [43] Rcpp_1.0.11 glue_1.6.2 gridExtra_2.3
## [46] xfun_0.40 withr_2.5.0 BiocManager_1.30.22
## [49] fastmap_1.1.1 latticeExtra_0.6-30 fansi_1.0.4
## [52] digest_0.6.33 R6_2.5.1 colorspace_2.1-0
## [55] jpeg_0.1-10 dichromat_2.0-0.1 biomaRt_2.56.1
## [58] hexbin_1.28.3 utf8_1.2.3 generics_0.1.3
## [61] data.table_1.14.8 prettyunits_1.1.1 httr_1.4.7
## [64] htmlwidgets_1.6.2 S4Arrays_1.0.5 pkgconfig_2.0.3
## [67] gtable_0.3.3 blob_1.2.4 htmltools_0.5.6
## [70] ProtGenerics_1.32.0 scales_1.2.1 png_0.1-8
## [73] rstudioapi_0.15.0 rjson_0.2.21 checkmate_2.2.0
## [76] curl_5.0.2 cachem_1.0.8 stringr_1.5.0
## [79] foreign_0.8-84 restfulr_0.0.15 pillar_1.9.0
## [82] vctrs_0.6.3 dbplyr_2.3.3 xtable_1.8-4
## [85] cluster_2.1.4 htmlTable_2.4.1 evaluate_0.21
## [88] cli_3.6.1 compiler_4.3.0 rlang_1.1.1
## [91] crayon_1.5.2 rngtools_1.5.2 labeling_0.4.2
## [94] interp_1.1-4 plyr_1.8.8 stringi_1.7.12
## [97] deldir_1.0-9 BiocParallel_1.34.2 munsell_0.5.0
## [100] lazyeval_0.2.2 Matrix_1.6-1 BSgenome_1.68.0
## [103] hms_1.1.3 bit64_4.0.5 KEGGREST_1.40.0
## [106] highr_0.10 memoise_2.0.1 affyio_1.70.0
## [109] bslib_0.5.1 bit_4.0.5
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B., “Mapping and quantifying mammalian transcriptomes by RNA-seq”, Nat Methods. 2008. http://www.nature.com/nmeth/journal/v5/n7/full/nmeth.1226.html
John C. Marioni, Christopher E. Mason, Shrikant M. Mane, Matthew Stephens, and Yoav Gilad, “RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays” Genome Res. 2008. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527709/
Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L., “Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation”, Nature Biotechnology, 2010. http://www.nature.com/nbt/journal/v28/n5/full/nbt.1621.html
Frazee AC, Langmead B, Leek JT. “ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets”. BMC Bioinformatics 12:449 http://www.ncbi.nlm.nih.gov/pubmed/22087737
The following sections give just a few examples of the many RNA-seq differential expression software packages:
The following methods are available on Bioconductor:
Michael I Love, Simon Anders, Wolfgang Huber, “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2” Genome Biology 2014. http://genomebiology.com/2014/15/12/550
Mark D. Robinson, Davis J. McCarthy, and Gordon K. Smyth, “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data” Bioinformatics 2010. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2796818/
Hao Wu, Chi Wang, Zhijin Wu, “A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data” Biostatistics 2013. http://biostatistics.oxfordjournals.org/content/14/2/232
Charity W Law, Yunshun Chen, Wei Shi and Gordon K Smyth, “voom: precision weights unlock linear model analysis tools for RNA-seq read counts”, Genome Biology. 2014. http://genomebiology.com/2014/15/2/R29
samr package on
CRANJun Li and Robert Tibshirani, “Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-seq data”, Stat Methods Med Res. 2013. http://smm.sagepub.com/content/22/5/519.short
Cuffdiff2) with cummeRbund the accompanying
Bioconductor visualization package.Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L., “Differential analysis of gene regulation at transcript resolution with RNA-seq” Nat Biotechnol. 2013. http://www.ncbi.nlm.nih.gov/pubmed/23222703
Peter Glaus, Antti Honkela, and Magnus Rattray, “Identifying differentially expressed transcripts from RNA-seq data with biological variation”, Bioinformatics. 2012. http://bioinformatics.oxfordjournals.org/content/28/13/1721